Skip to main content

Showing 1–3 of 3 results for author: Doutre, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2110.03841  [pdf, ps, other

    eess.AS cs.CL

    Input Length Matters: Improving RNN-T and MWER Training for Long-form Telephony Speech Recognition

    Authors: Zhiyun Lu, Yanwei Pan, Thibault Doutre, Parisa Haghani, Liangliang Cao, Rohit Prabhavalkar, Chao Zhang, Trevor Strohman

    Abstract: End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. In this paper we study the effect of training utterance length on the word e… ▽ More

    Submitted 1 April, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: submitted to INTERSPEECH 2022

  2. arXiv:2104.14346  [pdf, other

    cs.CL cs.SD eess.AS

    Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

    Authors: Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

    Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming mode… ▽ More

    Submitted 25 April, 2021; originally announced April 2021.

  3. arXiv:2010.12096  [pdf, other

    cs.SD cs.CL eess.AS

    Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

    Authors: Thibault Doutre, Wei Han, Min Ma, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Arun Narayanan, Ananya Misra, Yu Zhang, Liangliang Cao

    Abstract: Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a nov… ▽ More

    Submitted 21 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.