An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Moritz, Niko; Seide, Frank; Le, Duc; Mahadeokar, Jay; Fuegen, Christian

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2204.08858 (eess)

[Submitted on 19 Apr 2022 (v1), last revised 21 Oct 2022 (this version, v2)]

Title:An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Authors:Niko Moritz, Frank Seide, Duc Le, Jay Mahadeokar, Christian Fuegen

View PDF

Abstract:The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible with traditional FST-based ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way: By regularizing the training via joint LAS training or parameter initialization from RNN-T, both MonoRNN-T and CTC-T perform as well or better than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.

Comments:	Accepted to SLT 2022
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2204.08858 [eess.AS]
	(or arXiv:2204.08858v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2204.08858

Submission history

From: Niko Moritz [view email]
[v1] Tue, 19 Apr 2022 12:51:30 UTC (107 KB)
[v2] Fri, 21 Oct 2022 19:00:45 UTC (96 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators