Skip to main content

Showing 1–32 of 32 results for author: Gaur, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09569  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

    Authors: Frank Seide, Morrie Doulaty, Yangyang Shi, Yashesh Gaur, Junteng Jia, Chunyang Wu

    Abstract: We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle continuous audio without explicit end-pointing. Speech ReaLLM is a special case of the more general ReaLLM ("real-time LLM") approach, also introduced here for the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  2. arXiv:2311.02248  [pdf, other

    cs.CL cs.AI eess.AS

    COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning

    Authors: **g Pan, Jian Wu, Yashesh Gaur, Sunit Sivasankaran, Zhuo Chen, Shujie Liu, **yu Li

    Abstract: We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. Using GPT-3.5, we generate Speech Comprehension Test Question-Answer (SQA) pairs from speech transcriptions for supervised instruction tuning. With under 30 million trainable parameters… ▽ More

    Submitted 14 June, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

  3. arXiv:2310.14806  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

    Authors: Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, **yu Li, Yashesh Gaur

    Abstract: The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditional approaches to automatic speech recognition (ASR) and speech translation (ST) have often relied on separate systems, leading to inefficiencies in c… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  4. arXiv:2307.03917  [pdf, other

    eess.AS cs.CL cs.SD

    On decoder-only architecture for speech-to-text and large language model integration

    Authors: Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, **yu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu

    Abstract: Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA,… ▽ More

    Submitted 2 October, 2023; v1 submitted 8 July, 2023; originally announced July 2023.

  5. arXiv:2307.03354  [pdf, other

    cs.CL cs.SD eess.AS

    Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

    Authors: Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, **yu Li, Yashesh Gaur

    Abstract: In real-world applications, users often require both translations and transcriptions of speech to enhance their comprehension, particularly in streaming scenarios where incremental generation is necessary. This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. To produce ASR and… ▽ More

    Submitted 2 October, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

    Comments: Accepted at ASRU 2023

  6. arXiv:2305.16107  [pdf, other

    cs.CL cs.SD eess.AS

    VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

    Authors: Tianrui Wang, Long Zhou, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur, Zhuo Chen, **yu Li, Furu Wei

    Abstract: Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a con… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: Working in progress

  7. arXiv:2211.02809  [pdf, other

    cs.CL cs.SD eess.AS

    LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers

    Authors: Peidong Wang, Eric Sun, Jian Xue, Yu Wu, Long Zhou, Yashesh Gaur, Shujie Liu, **yu Li

    Abstract: Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure. It is thus possible to use a single transducer model to perform both tasks. In real-world applications, such joint ASR and ST models may need to be streaming and do not require source language identification (i.e. language-agnostic). In this paper, we propose LAMASSU, a streaming… ▽ More

    Submitted 19 October, 2023; v1 submitted 5 November, 2022; originally announced November 2022.

    Comments: INTERSPEECH 2023

  8. arXiv:2210.08665  [pdf, other

    eess.AS cs.SD

    Acoustic-aware Non-autoregressive Spell Correction with Mask Sample Decoding

    Authors: Ruchao Fan, Guoli Ye, Yashesh Gaur, **yu Li

    Abstract: Masked language model (MLM) has been widely used for understanding tasks, e.g. BERT. Recently, MLM has also been used for generation tasks. The most popular one in speech is using Mask-CTC for non-autoregressive speech recognition. In this paper, we take one step further, and explore the possibility of using MLM as a non-autoregressive spell correction (SC) model for transformer-transducer (TT), d… ▽ More

    Submitted 16 October, 2022; originally announced October 2022.

  9. arXiv:2210.08603  [pdf, other

    eess.AS cs.SD

    CTCBERT: Advancing Hidden-unit BERT with CTC Objectives

    Authors: Ruchao Fan, Yiming Wang, Yashesh Gaur, **yu Li

    Abstract: In this work, we present a simple but effective method, CTCBERT, for advancing hidden-unit BERT (HuBERT). HuBERT applies a frame-level cross-entropy (CE) loss, which is similar to most acoustic model training. However, CTCBERT performs the model training with the Connectionist Temporal Classification (CTC) objective after removing duplicated IDs in each masked region. The idea stems from the obser… ▽ More

    Submitted 28 April, 2023; v1 submitted 16 October, 2022; originally announced October 2022.

    Comments: Accepted to ICASSP2023

  10. arXiv:2204.05352  [pdf, other

    cs.CL eess.AS

    Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

    Authors: Jian Xue, Peidong Wang, **yu Li, Matt Post, Yashesh Gaur

    Abstract: Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly. Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduc… ▽ More

    Submitted 1 July, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: The paper was submitted to Interspeech 2022

  11. arXiv:2203.16685  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

    Authors: Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities,… ▽ More

    Submitted 14 July, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted for presentation at Interspeech 2022

  12. arXiv:2202.00842  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming Multi-Talker ASR with Token-Level Serialized Output Training

    Authors: Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their em… ▽ More

    Submitted 14 July, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: 6 pages, 1 figure, 7 tables, v2: minor fixes, v3: Appendix D has been added, v4: citation to [27] has been added, v5: citations to [28][29][30] have been added with minor fixes, short version accepted for presentation at Interspeech 2022

  13. arXiv:2112.05826  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Sequence-level self-learning with multiple hypotheses

    Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, **yu Li, Michael Zeng

    Abstract: In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multipl… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Published in Interspeech 2020: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Report number: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Journal ref: Proc. Interspeech 2020, page 3775-3779

  14. arXiv:2110.05354  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

    Authors: Zhong Meng, Yashesh Gaur, Naoyuki Kanda, **yu Li, Xie Chen, Yu Wu, Yifan Gong

    Abstract: Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an… ▽ More

    Submitted 26 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: 5 pages, in Interspeech 2022

    Journal ref: Interspeech 2022, Incheon, Korea

  15. arXiv:2110.03151  [pdf, other

    eess.AS cs.CL cs.SD

    Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

    Authors: Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: This paper presents Transcribe-to-Diarize, a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR). The E2E SA-ASR is a joint model that was recently proposed for speaker counting, multi-talker speech recognition, and speaker identification from monaural audio that contains overlap** speech. Although the E2E SA-ASR mode… ▽ More

    Submitted 21 January, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: To appear in ICASSP 2022; System labels (SC and VBx) in Table 1 have been fixed

  16. arXiv:2109.08555  [pdf, other

    eess.AS cs.SD

    Continuous Streaming Multi-Talker ASR with Dual-path Transducers

    Authors: Desh Raj, Liang Lu, Zhuo Chen, Yashesh Gaur, **yu Li

    Abstract: Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions. In this paper, we investigate it for multi-turn meetings containing multiple speakers using the Streaming Unmixing and Recognition Transducer (SURT) model, and show that naively extending the single-turn model to this harder setting incurs a performance penalty. As a solution, we… ▽ More

    Submitted 22 January, 2022; v1 submitted 17 September, 2021; originally announced September 2021.

    Comments: Accepted for publication at IEEE ICASSP 2022

  17. arXiv:2107.02852  [pdf, other

    eess.AS cs.CL cs.SD

    A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

    Authors: Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In th… ▽ More

    Submitted 17 September, 2021; v1 submitted 6 July, 2021; originally announced July 2021.

    Comments: To appear in ASRU 2021

  18. arXiv:2104.02128  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speaker-Attributed ASR with Transformer

    Authors: Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to INTERSPEECH 2021

  19. arXiv:2103.16776  [pdf, other

    eess.AS cs.CL cs.SD

    Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone

    Authors: Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR). While various approaches have been proposed, all previous studies on the monaural overlapped speech recognition problem were based on either simulation data or small-scale real data. In this paper, we extensively invest… ▽ More

    Submitted 12 April, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH 2021

  20. arXiv:2102.01380  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

    Authors: Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, **yu Li, Yifan Gong

    Abstract: The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method. In this method, the internal LM score is subtracted from the score obtained by interpolating the E2E score with the external LM score, during inference. To improve the ILME-based… ▽ More

    Submitted 22 April, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

    Comments: 5 pages, ICASSP 2021

    Journal ref: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada

  21. arXiv:2101.01853  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

    Authors: Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

    Abstract: An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch bet… ▽ More

    Submitted 5 January, 2021; originally announced January 2021.

    Comments: Submitted to ICASSP 2021

  22. arXiv:2011.04084  [pdf, other

    eess.AS cs.SD eess.IV

    Listen, Look and Deliberate: Visual context-aware speech recognition using pre-trained text-video representations

    Authors: Shahram Ghorbani, Yashesh Gaur, Yu Shi, **yu Li

    Abstract: In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model. Firstly, we propose a multi-stream attention architecture to leverage signals fro… ▽ More

    Submitted 8 November, 2020; originally announced November 2020.

    Comments: Accepted at SLT 2021

  23. arXiv:2011.03110  [pdf, other

    eess.AS cs.SD

    Exploring End-to-End Multi-channel ASR with Bias Information for Meeting Transcription

    Authors: Xiaofei Wang, Naoyuki Kanda, Yashesh Gaur, Zhuo Chen, Zhong Meng, Takuya Yoshioka

    Abstract: Joint optimization of multi-channel front-end and automatic speech recognition (ASR) has attracted much interest. While promising results have been reported for various tasks, past studies on its meeting transcription application were limited to small scale experiments. It is still unclear whether such a joint framework can be beneficial for a more practical setup where a massive amount of single… ▽ More

    Submitted 25 November, 2020; v1 submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted to SLT2021

  24. arXiv:2011.02921  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

    Authors: Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka

    Abstract: Recently, an end-to-end speaker-attributed automatic speech recognition (E2E SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. In the previous study, the model parameters were trained based on the speaker-attributed maximum mutual information (SA-MMI) criterion, with which the joint posterior probability f… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021. arXiv admin note: text overlap with arXiv:2006.10930, arXiv:2008.04546

  25. arXiv:2011.01991  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

    Authors: Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, **yu Li, Yifan Gong

    Abstract: The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: 8 pages, 2 figures, SLT 2021

    Journal ref: 2021 IEEE Spoken Language Technology Workshop (SLT)

  26. arXiv:2008.04546  [pdf, other

    eess.AS cs.CL cs.SD

    Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings

    Authors: Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. It showed promising results for simulated speech mixtures consisting of various numbers of speakers. However, the model required prior knowledge of speaker profiles to perform sp… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

  27. arXiv:2006.10930  [pdf, other

    eess.AS cs.CL cs.SD

    Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

    Authors: Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka

    Abstract: We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with attention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend SOT… ▽ More

    Submitted 8 August, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: Accepted to INTERSPEECH 2020

  28. arXiv:2005.14327  [pdf, ps, other

    eess.AS cs.CL

    On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

    Authors: **yu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

    Abstract: Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non… ▽ More

    Submitted 29 July, 2020; v1 submitted 28 May, 2020; originally announced May 2020.

    Comments: Accepted by Interspeech 2020

  29. arXiv:2003.12687  [pdf, other

    cs.CL cs.SD eess.AS

    Serialized Output Training for End-to-End Overlapped Speech Recognition

    Authors: Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

    Abstract: This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation invariant training (PIT), SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another. The attention and… ▽ More

    Submitted 8 August, 2020; v1 submitted 27 March, 2020; originally announced March 2020.

    Comments: Accepted to INTERSPEECH 2020

  30. arXiv:2001.01798  [pdf, other

    eess.AS cs.CL cs.LG cs.NE cs.SD

    Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

    Authors: Zhong Meng, **yu Li, Yashesh Gaur, Yifan Gong

    Abstract: Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder g… ▽ More

    Submitted 6 January, 2020; originally announced January 2020.

    Comments: 8 pages, 2 figures, ASRU 2019

    Journal ref: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore

  31. arXiv:2001.01795  [pdf, other

    eess.AS cs.CL cs.LG cs.NE cs.SD

    Character-Aware Attention-Based End-to-End Speech Recognition

    Authors: Zhong Meng, Yashesh Gaur, **yu Li, Yifan Gong

    Abstract: Predicting words and subword units (WSUs) as the output has shown to be effective for the attention-based encoder-decoder (AED) model in end-to-end speech recognition. However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion. Little effort has been made to explicitly mod… ▽ More

    Submitted 6 January, 2020; originally announced January 2020.

    Comments: 7 pages, 3 figures, ASRU 2019

    Journal ref: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore

  32. arXiv:1911.03762  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Speaker Adaptation for Attention-Based End-to-End Speech Recognition

    Authors: Zhong Meng, Yashesh Gaur, **yu Li, Yifan Gong

    Abstract: We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the spe… ▽ More

    Submitted 9 November, 2019; originally announced November 2019.

    Comments: 5 pages, 3 figures, Interspeech 2019

    Journal ref: Interspeech 2019, Graz, Austria