HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: anyfontsize

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2309.08436v2 [eess.AS] 17 Jan 2024

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Abstract

We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.

Index Terms—  Chunked attention models, transducer, streamable

1 Introduction & Related Work

Among the potential streaming models, there are the traditional HMM [1], CTC [2] and more recently transducer [3]. While many streamable attention-based encoder-decoder (AED) models were proposed [4, 5, 6, 7, 8], they are too complicated, relying on too much heuristics and not being robust enough in comparison to the transducer [9].

Here we show, how a seemingly very simple modification makes the AED model streamable and turns out to be very robust and competitive, specifically on long-form speech, in contrast to many other AED and transducer models [10, 11, 12, 8, 9]. Interestingly, the small modification leads to an equivalence to transducer models, and we study the exact modeling differences.

We use chunking as the core mechanism for both the encoder and cross-attention in the decoder. This means that we take out chunks (windows) of fixed width and fixed step sizes (striding). The static step size implies that we have a variable number of labels per chunk. The static sizes in the encoder also allow for efficient processing in training and recognition, more efficient than causal self-attention and also performing better.

Related to chunkwise processing is the operation on segments with variable boundaries in segmental attention models [8], or on fixed-size windows at variable positions [13]. Having variable positions or segment boundaries allows to use a single label per window or segment. In contrast, using fixed-size chunks at fixed positions implies that we have a variable number of labels per chunk. Further, we can use the same chunking in the encoder, with the big advantage that we can parallelize the training computation in the encoder independent of the alignment.

Similar chunking in the decoder has been done in [14, 15, 16, 17, 18, 19] and similar chunking in the encoder has been done in [20, 21, 22, 23, 7, 24, 25, 26, 27, 28, 29]. There are also other approaches to make self-attention in the encoder streamable [30, 31, 9].

2 Global AED Model

Our baseline is the standard global attention-based encoder-decoder (AED) model [32] adapted for speech recognition [4, 33, 34, 35]. We use a Conformer-based encoder [36]. The model operates on a sequence of audio feature frames x1:TT×Dsubscript𝑥:1𝑇superscript𝑇𝐷x_{1:T}\in\mathbb{R}^{T\times D}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT (10ms resolution) of length T𝑇Titalic_T as input and encodes it as a sequence

h1:T=GlobalEncoder(x1:T)T×Dencsubscript:1superscript𝑇GlobalEncodersubscript𝑥:1𝑇superscriptsuperscript𝑇subscript𝐷ench_{1:T^{\prime}}=\operatorname{GlobalEncoder}(x_{1:T})\in\mathbb{R}^{T^{\prime% }\times D_{\textrm{enc}}}italic_h start_POSTSUBSCRIPT 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = roman_GlobalEncoder ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

of length Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and encoder feature dimension Dencsubscript𝐷encD_{\textrm{enc}}italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT. The encoder has a convolutional frontend with striding in time which downsamples the input by a factor of 6. Thus, the encoder outputs a frame every 60ms and T=T6superscript𝑇𝑇6T^{\prime}=\lceil\frac{T}{6}\rceilitalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ⌈ divide start_ARG italic_T end_ARG start_ARG 6 end_ARG ⌉.

The probability of the output label sequence a1:S𝒜Ssubscript𝑎:1𝑆superscript𝒜𝑆a_{1:S}\in\mathcal{A}^{S}italic_a start_POSTSUBSCRIPT 1 : italic_S end_POSTSUBSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT given the encoder output sequence h1:Tsubscript:1𝑇h_{1:T}italic_h start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT is defined as

p(a1:Sh1:T)𝑝conditionalsubscript𝑎:1𝑆subscript:1superscript𝑇\displaystyle p(a_{1:S}\mid h_{1:T^{\prime}})italic_p ( italic_a start_POSTSUBSCRIPT 1 : italic_S end_POSTSUBSCRIPT ∣ italic_h start_POSTSUBSCRIPT 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) =s=1Sp(asa1:s1,h1:T).absentsuperscriptsubscriptproduct𝑠1𝑆𝑝conditionalsubscript𝑎𝑠subscript𝑎:1𝑠1subscript:1superscript𝑇\displaystyle=\prod_{s=1}^{S}p(a_{s}\mid a_{1:s-1},h_{1:T^{\prime}}).= ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_p ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ italic_a start_POSTSUBSCRIPT 1 : italic_s - 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) .

We have aS=EOSsubscript𝑎𝑆EOSa_{S}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOS}$}}}}italic_a start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = roman_EOS to mark the end of the sequence (EOS), which implicitly models the probability of the sequence length. This part of the model is called the decoder. The decoder uses global attention on h1:Tsubscript:1𝑇h_{1:T}italic_h start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT per output step s𝑠sitalic_s. The main and sole difference of the global decoder vs. the chunked decoder is global attention vs. chunked attention. The decoder is defined below.

3 Chunked AED Model

Refer to caption
Fig. 1: Chunking on input frames x1:Tsubscript𝑥normal-:1𝑇x_{1:T}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT with chunk center size Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, right context Trsubscript𝑇𝑟T_{r}italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and stride Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where we have Ts=Twsubscript𝑇𝑠subscript𝑇𝑤T_{s}=T_{w}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT.

As visualized in Fig. 1, we extract strided windows called ‘chunks’ with chunk size Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and stride Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. For input x1:T=(x1,,xT)T×Dsubscript𝑥:1𝑇subscript𝑥1subscript𝑥𝑇superscript𝑇𝐷x_{1:T}=(x_{1},\dots,x_{T})\in\mathbb{R}^{T\times D}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT, we get the chunks x1:K,1:TwK×Tw×Dsubscriptsuperscript𝑥:1𝐾1:subscript𝑇𝑤superscript𝐾subscript𝑇𝑤𝐷x^{\prime}_{1:K,1:T_{w}}\in\mathbb{R}^{K\times T_{w}\times D}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K , 1 : italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT with xk,1:TwTw×Dsubscriptsuperscript𝑥:𝑘1subscript𝑇𝑤superscriptsubscript𝑇𝑤𝐷x^{\prime}_{k,1:T_{w}}\in\mathbb{R}^{T_{w}\times D}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , 1 : italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT for chunk index k{1,,K}𝑘1𝐾k\in\{1,\dots,K\}italic_k ∈ { 1 , … , italic_K } with K=TTs𝐾𝑇subscript𝑇𝑠K=\lceil\frac{T}{T_{s}}\rceilitalic_K = ⌈ divide start_ARG italic_T end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ⌉, where

xk,t=x(k1)Ts+tD,t{1,,Tw}.formulae-sequencesubscriptsuperscript𝑥𝑘𝑡subscript𝑥𝑘1subscript𝑇𝑠𝑡superscript𝐷𝑡1subscript𝑇𝑤x^{\prime}_{k,t}=x_{(k-1)\cdot T_{s}+t}\in\mathbb{R}^{D},\quad t\in\{1,\dots,T% _{w}\}.italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT ( italic_k - 1 ) ⋅ italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , italic_t ∈ { 1 , … , italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT } .

Additionally, we might extend the chunk size by Trsubscript𝑇𝑟T_{r}italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT more frames to get some extended right context.

For the streaming model, the chunking is applied directly on the input (e.g. log mel features every 10ms), and then a variant of the Conformer encoder works on the chunks x1:K,1:Twsubscriptsuperscript𝑥:1𝐾1:subscript𝑇𝑤x^{\prime}_{1:K,1:T_{w}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K , 1 : italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT and calculates the encoder output

h1:K,1:Tw=ChunkedEncoder(x1:K,1:Tw)K×Tw×Dencsubscriptsuperscript:1𝐾1:subscriptsuperscript𝑇𝑤ChunkedEncodersubscriptsuperscript𝑥:1𝐾1:subscript𝑇𝑤superscript𝐾subscriptsuperscript𝑇𝑤subscript𝐷ench^{\prime}_{1:K,1:T^{\prime}_{w}}=\operatorname{ChunkedEncoder}(x^{\prime}_{1:% K,1:T_{w}})\in\mathbb{R}^{K\times T^{\prime}_{w}\times D_{\textrm{enc}}}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K , 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_ChunkedEncoder ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K , 1 : italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

where Tw=Tw6subscriptsuperscript𝑇𝑤subscript𝑇𝑤6T^{\prime}_{w}=\lceil\frac{T_{w}}{6}\rceilitalic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = ⌈ divide start_ARG italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG 6 end_ARG ⌉.

For comparison, we also use a standard Conformer with global attention applied on the whole input

h1:T=GlobalEncoder(x1:T)T×Dencsubscript:1superscript𝑇GlobalEncodersubscript𝑥:1𝑇superscriptsuperscript𝑇subscript𝐷ench_{1:T^{\prime}}=\operatorname{GlobalEncoder}(x_{1:T})\in\mathbb{R}^{T^{\prime% }\times D_{\textrm{enc}}}italic_h start_POSTSUBSCRIPT 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = roman_GlobalEncoder ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

and apply chunking on the encoder output h1:Tsubscript:1superscript𝑇h_{1:T^{\prime}}italic_h start_POSTSUBSCRIPT 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT such that we get the chunked encoder output h1:K,1:Twsubscriptsuperscript:1𝐾1:subscriptsuperscript𝑇𝑤h^{\prime}_{1:K,1:T^{\prime}_{w}}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_K , 1 : italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

3.1 Streamable Chunked Encoder

Refer to caption
Fig. 2: Chunked self-attention in the encoder.

Our starting point is the standard Conformer, operating on chunks instead of the whole sequence, i.e. operating on xk,1:Twsubscript𝑥:𝑘1subscript𝑇𝑤x_{k,1:T_{w}}italic_x start_POSTSUBSCRIPT italic_k , 1 : italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT for every chunk index k𝑘kitalic_k. The self-attention is calculated per chunk, i.e. both the chunk center and right context frames, and attends to all frames within the chunk, and additionally to the previous chunk, as can be seen in Fig. 2. Thus it is non-causal within the chunk, just like the convolution. The decoder cross-attention will afterwards only access the chunk center frames, thus we expect that the chunk center covers the labels for this chunk. The future lookahead via the right context frames does not accumulate over multiple layers, unlike the history context, where we access the previous chunk, thus the history context does accumulate over multiple layers. This also explains why we don’t need to have any additional left context frames within the chunk.

In training, we can calculate all chunks in parallel, and the self-attention calculation per chunk is more efficient compared to the global self-attention. We only get a small overhead due to the overlap of the chunk via the right context frames.

Note that this is mathematically equivalent to the same kind of look-ahead context leaking avoidance as in the Emformer [24] and dual causal/non-causal self-attention [31].

3.2 Streamable Chunked Decoder

Refer to caption
Fig. 3: Possible transition sequences a1:K+Nsubscript𝑎normal-:1𝐾𝑁a_{1:K+N}italic_a start_POSTSUBSCRIPT 1 : italic_K + italic_N end_POSTSUBSCRIPT for non-EOC label sequence ABC𝐴𝐵𝐶ABCitalic_A italic_B italic_C with length N=3𝑁3N=3italic_N = 3 and K=4𝐾4K=4italic_K = 4 chunks, where ε𝜀\varepsilonitalic_ε is the end-of-chunk (EOC) symbol.

In the output vocabulary 𝒜𝒜\mathcal{A}caligraphic_A, we replace the EOSEOS\mathrm{EOS}roman_EOS by a new special end-of-chunk (EOC) symbol EOCEOC\mathrm{EOC}roman_EOC . We start with the first chunk (k=1𝑘1k=1italic_k = 1), and once we get EOCEOC\mathrm{EOC}roman_EOC , we advance to the next chunk (k=k+1superscript𝑘𝑘1k^{\prime}=k+1italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_k + 1). The decoder is exactly like in the global AED model, except that the global attention is replaced by attention on the current chunk. The possible transitions can be seen in Fig. 3.

The probability to emit the next label as𝒜subscript𝑎𝑠𝒜a_{s}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ caligraphic_A is estimated using a LSTM [37] with zoneout [38] and MLP cross-attention [32] to the current chunk of the encoder:

p(as)𝑝conditionalsubscript𝑎𝑠\displaystyle p(a_{s}\mid...)italic_p ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ … ) =(softmaxLinearmaxoutLinear)(gs,cs)absentsoftmaxLinearmaxoutLinearsubscript𝑔𝑠subscript𝑐𝑠\displaystyle=(\operatorname{softmax}\circ\operatorname{Linear}\circ% \operatorname{maxout}\circ\operatorname{Linear})\big{(}g_{s},c_{s}\big{)}= ( roman_softmax ∘ roman_Linear ∘ roman_maxout ∘ roman_Linear ) ( italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
gssubscript𝑔𝑠\displaystyle g_{s}italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =ZoneoutLSTM(c1:s1,a1:s1)absentZoneoutLSTMsubscript𝑐:1𝑠1subscript𝑎:1𝑠1\displaystyle=\operatorname{ZoneoutLSTM}(c_{1:s-1},a_{1:s-1})= roman_ZoneoutLSTM ( italic_c start_POSTSUBSCRIPT 1 : italic_s - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 : italic_s - 1 end_POSTSUBSCRIPT )
cssubscript𝑐𝑠\displaystyle c_{s}italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =t=1Twαs,thks,tDencabsentsuperscriptsubscript𝑡1subscriptsuperscript𝑇𝑤subscript𝛼𝑠𝑡subscriptsuperscriptsubscript𝑘𝑠𝑡superscriptsubscript𝐷enc\displaystyle=\sum_{t=1}^{T^{\prime}_{w}}\alpha_{s,t}\cdot h^{\prime}_{k_{s},t% }\in\mathbb{R}^{D_{\textrm{enc}}}= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
αs,tsubscript𝛼𝑠𝑡\displaystyle\alpha_{s,t}italic_α start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT =exp(es,t)τ=1Twexp(es,τ),t{1,,Tw}formulae-sequenceabsentsubscript𝑒𝑠𝑡superscriptsubscript𝜏1subscriptsuperscript𝑇𝑤subscript𝑒𝑠𝜏𝑡1subscriptsuperscript𝑇𝑤\displaystyle=\frac{\exp(e_{s,t})}{\sum_{\tau=1}^{T^{\prime}_{w}}\exp(e_{s,% \tau})}\in\mathbb{R},\quad t\in\{1,\dots,T^{\prime}_{w}\}= divide start_ARG roman_exp ( italic_e start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_e start_POSTSUBSCRIPT italic_s , italic_τ end_POSTSUBSCRIPT ) end_ARG ∈ blackboard_R , italic_t ∈ { 1 , … , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT }
es,tsubscript𝑒𝑠𝑡\displaystyle e_{s,t}italic_e start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT =(LineartanhLinear)(gs,ht),absentLinearLinearsubscript𝑔𝑠subscript𝑡\displaystyle=(\operatorname{Linear}\circ\tanh\circ\operatorname{Linear})\big{% (}g_{s},h_{t}\big{)}\in\mathbb{R},= ( roman_Linear ∘ roman_tanh ∘ roman_Linear ) ( italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ blackboard_R ,

and the current chunk index kssubscript𝑘𝑠k_{s}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is defined as

ks={ks1+1,as1=EOCks1,as1EOCsubscript𝑘𝑠casessubscript𝑘𝑠11subscript𝑎𝑠1EOCsubscript𝑘𝑠1subscript𝑎𝑠1EOCk_{s}=\begin{cases}k_{s-1}+1,&a_{s-1}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{% gray!10}{$\mathrm{EOC}$}}}}\\ k_{s-1},&a_{s-1}\neq{{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{% EOC}$}}}}\end{cases}italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = { start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT + 1 , end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT = roman_EOC end_CELL end_ROW start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT , end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_s - 1 end_POSTSUBSCRIPT ≠ roman_EOC end_CELL end_ROW

and initially k1=1subscript𝑘11k_{1}=1italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1. The sequence is ended when we reach ks=Ksubscript𝑘𝑠𝐾k_{s}=Kitalic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_K and as=EOCsubscript𝑎𝑠EOCa_{s}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOC}$}}}}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_EOC. The attention weights here are only calculated inside the current chunk. Further, we do not use attention weight feedback. Otherwise the model is exactly the same as the global attention decoder, to allow for direct comparisons, and also to import model parameters.

We realize that the chunked decoder is equivalent to a transducer model [3, 12], where EOCEOC\mathrm{EOC}roman_EOC behaves exactly like the blank symbol, and we iterate over chunks instead of frames, which is like a higher downsampling rate. A similar observation for a similar model has been made in [16]. The main difference is the cross-attention and the decoder LSTM dependence on the encoder output. Note that this is a different kind of equivalence compared to [39], where a segmental model is rewritten in a framewise manner.

3.3 Training

We create a chunkwise alignment from an existing framewise alignment, then add the EOC labels, and train with labelwise cross-entropy, just like the standard AED training. This is different to the standard transducer training, which performs a full sum over all alignment paths. The standard transducer training criterion cannot be applied easily here due to the alignment label dependencies [12, 9].

3.4 Beam Search

We perform alignment-synchronous search, meaning that in each step, all hypotheses have the same number of labels, including EOCEOC\mathrm{EOC}roman_EOC . It is exactly the same as the alignment-synchronous transducer search [40, 12].

For the very best results, we make use of an external language model (LM) and perform internal language model (ILM) prior correction [41]. Note that the chunked AED model has the EOC label (blank label) instead of the EOS label. We use the scores

P(as|)={PAEDα(as|)PLMβ(as|)PILMλ(as|),asEOCPAED(EOC|),as=EOC,k<KPAEDα(EOC|)PLMβ(EOS|)PILMλ(EOS|),as=EOC,k=K𝑃conditionalsubscript𝑎𝑠casessuperscriptsubscript𝑃AED𝛼conditionalsubscript𝑎𝑠𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒absentsubscriptsuperscript𝑃𝛽LMconditionalsubscript𝑎𝑠𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒absentsubscriptsuperscript𝑃𝜆ILMconditionalsubscript𝑎𝑠subscript𝑎𝑠EOCsubscript𝑃AEDconditionalEOCformulae-sequencesubscript𝑎𝑠EOC𝑘𝐾superscriptsubscript𝑃AED𝛼conditionalEOC𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒absentsubscriptsuperscript𝑃𝛽LMconditionalEOS𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒absentsubscriptsuperscript𝑃𝜆ILMconditionalEOSformulae-sequencesubscript𝑎𝑠EOC𝑘𝐾P(a_{s}|...)=\begin{cases}P_{\textrm{AED}}^{\alpha}(a_{s}|...)\\ {}\quad\cdot P^{\beta}_{\textrm{LM}}(a_{s}|...)\\ {}\quad\cdot P^{-\lambda}_{\textrm{ILM}}(a_{s}|...),&a_{s}\neq{{\scalebox{0.6}% [1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOC}$}}}}\\ P_{\textrm{AED}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOC% }$}}}}|...),&a_{s}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{% EOC}$}}}},k<K\\ P_{\textrm{AED}}^{\alpha}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$% \mathrm{EOC}$}}}}|...)\\ {}\quad\cdot P^{\beta}_{\textrm{LM}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{% gray!10}{$\mathrm{EOS}$}}}}|...)\\ {}\quad\cdot P^{-\lambda}_{\textrm{ILM}}({{\scalebox{0.6}[1.0]{\fcolorbox{% white}{gray!10}{$\mathrm{EOS}$}}}}|...),&a_{s}={{\scalebox{0.6}[1.0]{% \fcolorbox{white}{gray!10}{$\mathrm{EOC}$}}}},k=K\end{cases}italic_P ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | … ) = { start_ROW start_CELL italic_P start_POSTSUBSCRIPT AED end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | … ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⋅ italic_P start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | … ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⋅ italic_P start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ILM end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | … ) , end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≠ roman_EOC end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT AED end_POSTSUBSCRIPT ( roman_EOC | … ) , end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_EOC , italic_k < italic_K end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT AED end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( roman_EOC | … ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⋅ italic_P start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( roman_EOS | … ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⋅ italic_P start_POSTSUPERSCRIPT - italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ILM end_POSTSUBSCRIPT ( roman_EOS | … ) , end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_EOC , italic_k = italic_K end_CELL end_ROW

where α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and λ𝜆\lambdaitalic_λ are tuned scales and we set α=1β𝛼1𝛽\alpha=1-\betaitalic_α = 1 - italic_β. PLM(EOS|)subscript𝑃LMconditionalEOSP_{\textrm{LM}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOS}% $}}}}|...)italic_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( roman_EOS | … ) and PILM(EOS|)subscript𝑃ILMconditionalEOSP_{\textrm{ILM}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOS% }$}}}}|...)italic_P start_POSTSUBSCRIPT ILM end_POSTSUBSCRIPT ( roman_EOS | … ) are set to 0 for k<K𝑘𝐾k<Kitalic_k < italic_K and then renormalized. This is very similar to the EOS handling for transducer models with ILM prior correction [42] except that our ILM also has EOS and the renormalization. We use Mini-LSTM ILM method [41] that is trained on the train transcription labels with EOSEOS\mathrm{EOS}roman_EOS same as LM training data.

4 Experiments

We conduct experiments on LibriSpeech 960h [43] and TED-LIUM-v2 200h [44] using BPE labels [45]. We use RETURNN [46] based on TensorFlow [47]. All code including full recipes are online111https://github.com/rwth-i6/returnn-experiments/tree/master/2023-chunked-aed.

We train the global AED model for 100 epochs using a single consumer GPU. We apply on-the-fly speed perturbation and SpecAugment [34]. The encoder consists of 12 Conformer layers with 512 model dims. and decoder LSTM has 1024 dims. We use an aux. CTC loss [48] on top of encoder output for better training convergence and for the alignments.

To train the chunked AED models, we extract time-synchronous alignments from the jointly trained CTC model with disallowed label loop. Then, we convert such alignment into a chunk-synchronous one and use that as targets for cross-entropy training. We initialize all parameters using the best checkpoint of the global AED model and train for 15-30 epochs.

4.1 Chunked Decoder

Table 1: WERs [%], studying chunked decoder with different chunk sizes with no overlap when using global encoder. \infty means global decoder. The frame rate of hhitalic_h is 60 ms.
Chunk size TED-v2 LibriSpeech
Twsubscriptsuperscript𝑇𝑤T^{\prime}_{w}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT Sec. dev test dev-other test-other
1111 0.060.060.060.06 7.5 7.3 5.8 6.0
5555 0.30.30.30.3 7.3 7.1 5.7 5.9
10101010 0.60.60.60.6 7.3 6.9 5.7 5.7
25252525 1.51.51.51.5 7.4 6.9 5.6 5.7
\infty 7.4 6.9 5.6 5.7

First, we investigate the effect of chunking only in the decoder, i.e. chunking the output hhitalic_h of the global encoder. Results on TED-LIUM-v2 and LibriSpeech are shown in Table 1. We can observe that we are able to achieve same WERs as the global AED model even with small chunk sizes.

4.2 Chunked Encoder-Decoder

Table 2: For chunked AED, effect on WERs[%] for carry-over history context, center chunk size Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, lookahead future context Trsubscript𝑇𝑟T_{r}italic_T start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. All sizes are in seconds.
Carry- over Chunk size Look- ahead TED-v2 LibriSpeech
dev test dev-o. test-o.
2.4 0.6 0.3 8.2 7.6 7.2 7.4
0.6 7.9 7.4 6.8 6.8
0.9 7.7 7.1 6.6 6.7
0 1.2 0.3 8.6 8.0 7.1 7.0
1.2 7.9 7.3 6.8 6.8
2.4 7.7 7.3 6.7 6.7
3.6 7.7 7.3 6.7 6.7
2.4 1.2 0 10.2 9.7 7.8 7.8
0.6 7.8 7.2 6.5 6.6
0.9 7.5 7.1 6.2 6.3
3.0 1.5 0.3 7.7 7.3 6.3 6.3
3.6 1.8 7.5 7.1 6.2 6.2
\infty 7.4 6.9 5.6 5.7

Table 2 shows WER results of the chunked AED model. We observe that carrying over left context yields improvement, where 2.4 seconds is enough. In addition, using future lookahead gives good improvements in all cases. The chunked AED model with a total chunk size and lookahead of 2.1 seconds achieves a WER of 7.1%percent7.17.1\%7.1 % and 6.2%percent6.26.2\%6.2 % on TED-LIUM-v2 and LibriSpeech test sets respectively, a relative increase in WER of 4%percent44\%4 % and 9%percent99\%9 % compared to global AED model.

4.3 Latency

Table 3: Word emit latency for chunked AED model on TED-LIUM-v2 dev dataset. All timing values are in seconds.
Carry- over Chunk size Look- ahead Latency WER [%]
%50th\%50^{\textrm{th}}% 50 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT %95th\%95^{\textrm{th}}% 95 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT %99th\%99^{\textrm{th}}% 99 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT dev
2.4 0.6 0.9 1.08 1.39 1.44 7.7
1.2 0.3 0.78 1.34 1.42 7.7
0.6 1.08 1.63 1.71 7.8
0.9 1.39 1.94 2.02 7.5
3.6 1.8 0.3 1.11 1.90 2.01 7.5

We compute the difference between the word end time from a GMM alignment and the chunk end time in which the word is emitted by the chunked AED model. Word emit latency measures can be found in Table 3. Lookahead seems to add more latency compared to using larger chunk sizes (rows: 1 vs 2, 4 vs 5).

4.4 Long-Form Recognition

Table 4: WERs [%] of long-form speech recognition on TED-LIUM-v2 test dataset with 𝒞𝒞\mathcal{C}caligraphic_C concatenated sequences.
𝒞𝒞\mathcal{C}caligraphic_C Sequence lengths (sec) Global Enc. Ch. Enc.
000.00Mean ±plus-or-minus\pm± Std 00.00Min - Max Gl. Dec. Chunk Dec.
1111 008.20 ±plus-or-minus\pm± 04.30 00.35 - 032.55 6.96.96.96.9 6.96.96.96.9 7.37.37.37.3
2222 023.10 ±plus-or-minus\pm± 08.50 00.41 - 045.70 7.07.07.07.0 6.96.96.96.9 7.17.17.17.1
4444 033.70 ±plus-or-minus\pm± 11.90 00.41 - 070.70 9.29.29.29.2 7.07.07.07.0 7.07.07.07.0
8888 065.95 ±plus-or-minus\pm± 22.19 07.19 - 116.99 23.423.423.423.4 7.17.17.17.1 7.17.17.17.1
10101010 082.51 ±plus-or-minus\pm± 26.87 15.67 - 142.08 34.234.234.234.2 7.17.17.17.1 7.07.07.07.0
20202020 160.14 ±plus-or-minus\pm± 53.98 17.83 - 237.27 62.462.462.462.4 7.17.17.17.1 7.07.07.07.0

We investigate the generalization on long-form speech recognition. We conduct these experiments on TED-LIUM-v2 by concatenating 𝒞𝒞\mathcal{C}caligraphic_C consecutive sequences from the same recording to create much longer sequences than what was seen in training. We compare the global AED baseline to a chunked AED model with left context carry-over 2.42.42.42.4 sec, chunk size 1.21.21.21.2 sec, lookahead 0.320.320.320.32 sec and to a chunked-decoder with global encoder. From the results in Table 4, we can observe that the global AED becomes much worse on longer sequences whereas the chunked AED model generalizes very well and even improves, which is probably because the decoder now has better LM context. This is also the case when only the decoder is chunked. The relative positional encoding in the encoder is probably helpful. The generalization is much better than other variants such as segmental AED model [8], although that work uses an LSTM-based encoder.

4.5 Beam Size and Length Normalization

Table 5: Comparison of effect of beam sizes and length normalization between global AED and chunked AED models. WERs [%] on TED-LIUM-v2 test dataset.
Length Norm. Beam Global Chunked
(No influence) 1111 7.17.17.17.1 7.47.47.47.4
Yes 12121212 6.96.96.96.9 7.37.37.37.3
32323232 6.96.96.96.9 7.37.37.37.3
64646464 6.96.96.96.9 7.37.37.37.3
No 12121212 7.07.07.07.0 7.37.37.37.3
32323232 8.58.58.58.5 7.37.37.37.3
64646464 10.910.910.910.9 7.37.37.37.3

Global AED model suffers from the length bias problem [49] because there is no explicit length modeling which pushes the model to prefer short hypothesis, especially when increasing the beam size. However, the chunked AED model, like transducer, does not have the length bias issue since this is modeled by the EOC symbol. To verify this, we run experiments with different beam sizes and optional length normalization [50] for both global AED and chunked AED model on TED-LIUM-v2 test dataset. Results are shown in Table 5. The global AED model degrades a lot as we increase beam size and disable length normalization whereas the chunked AED model does not need such heuristic and performance remains consistent. Additionally, both models perform marginally worse with greedy recognition.

4.6 External Language Model

Table 6: WERs [%] with Transformer and LSTM language model integration on LibriSpeech dataset.
Model LM ILM dev-other test-other
Global AED - - 5.6 5.7
LSTM No 4.6 5.0
Yes 4.3 4.5
Trafo 3.7 4.2
Chunked AED - - 6.2 6.2
LSTM No 5.2 5.3
Yes 4.5 4.8
Trafo 4.4 4.7

Table 6 shows results with LM integration on LibriSpeech dataset. The chunked AED model used is the best model from Table 2. Interestingly, the WER performance gap between global AED and chunked AED is reduced when using LSTM LM and ILM subtraction. Both models gain huge improvement from the LM integration.

4.7 Comparison to Transducer

Table 7: WERs [%] for transition towards original transducer, using global encoder, chunked decoder, chunk size 1.
Model TED-v2 LibriSpeech
dev test dev-other test-other
Baseline with Tw=1subscriptsuperscript𝑇𝑤1T^{\prime}_{w}=1italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1 7.5 7.3 5.8 6.0
  + EOC masking in g𝑔gitalic_g 7.6 7.2 5.8 6.1
    + Remove cssubscript𝑐𝑠c_{s}italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT dep. in g𝑔gitalic_g 7.7 7.4 6.0 6.1

We study the transition from a chunked AED model with chunk size 1 into a transducer model [3]. The attention context vector cssubscript𝑐𝑠c_{s}italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in this variant is the encoder hidden representation htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at t=ks𝑡subscript𝑘𝑠t=k_{s}italic_t = italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT because the model attends to a single frame at a time. We first mask out the EOCEOC\mathrm{EOC}roman_EOC labels (blank in transducer) from the decoder LSTM g𝑔gitalic_g, as the decoder LSTM in the original transducer only operates on non-blank labels. Further, we completely remove the dependency to the encoder hhitalic_h from the decoder LSTM, just like in the original transducer. Results are shown in Table 7. We see that the additional dependencies seem to be helpful, consistent with [12].

5 Conclusion

In this work, we investigate a streamable chunked attention-based encoder-decoder (AED) model. We show that this model is competitive compared to non-streamable global AED model and generalizes very well on long-form speech recognition. All degradations occur only in the chunked encoder – a chunked decoder with global encoder performs just as well as the global AED model. We study the equivalence to the transducer model and find the extensions to be helpful.

ACKNOWLEDGEMENT

This work was partially supported by NeuroSys, which as part of the initiative “Clusters4Future” is funded by the Federal Ministry of Education and Research BMBF (03ZU1106DA), and by the project RESCALE within the program AI Lighthouse Projects for the Environment, Climate, Nature and Resources funded by the Federal Ministry for the Environment, Nature Conservation, Nuclear Safety and Consumer Protection (BMUV), funding ID: 67KI32006A. We thank Wei Zhou, Nick Rossenbach, Zoltán Tüske, Zijian Yang for useful discussions.

References

  • [1] H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach, vol. 247, Springer Science & Business Media, 1994.
  • [2] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in ICML, New York, NY, USA, 2006, p. 369–376.
  • [3] A. Graves, “Sequence transduction with recurrent neural networks,” Preprint arXiv:1211.3711, 2012.
  • [4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in NIPS, 2015, pp. 577–585.
  • [5] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in ICLR, 2018.
  • [6] R. Hsiao, D. Can, T. Ng, R. Travadi, and A. Ghoshal, “Online automatic speech recognition with listen, attend and spell model,” IEEE Signal Processing Letters, vol. 27, 2020.
  • [7] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer ASR with blockwise synchronous beam search,” in SLT, 2021, pp. 22–29.
  • [8] A. Zeyer, R. Schmitt, W. Zhou, R. Schlüter, and H. Ney, “Monotonic segmental attention for automatic speech recognition,” in IEEE Spoken Language Technology Workshop, Doha, Qatar, Jan. 2023, pp. 229–236.
  • [9] R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-end speech recognition: A survey,” Preprint arXiv:2303.03329, 2023.
  • [10] A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach, T. N. Sainath, and T. Strohman, “Recognizing long-form speech using streaming end-to-end models,” in ASRU, 2019.
  • [11] C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan, R. Prabhavalkar, Z. Chen, T. Sainath, and Y. Wu, “A comparison of end-to-end models for long-form speech recognition,” in ASRU, 2019, pp. 889–896.
  • [12] A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “A new training pipeline for an improved neural transducer,” in Interspeech, Shanghai, China, Oct. 2020, pp. 2812–2816.
  • [13] A. Zeyer, R. Schlüter, and H. Ney, “A study of latent monotonic attention variants,” Preprint arXiv:2103.16710, 2021.
  • [14] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio, “An online sequence-to-sequence model using partial conditioning,” in NIPS, 2016, vol. 29.
  • [15] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, and Z. Chen, “Improving the performance of online neural transducer models,” in ICASSP, 2018.
  • [16] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronous transformers for end-to-end speech recognition,” in ICASSP, May 2020, pp. 7884–7888.
  • [17] P. Wilken, T. Alkhouli, E. Matusov, and P. Golik, “Neural simultaneous speech translation using alignment-based chunking,” in SLT, Online, July 2020, pp. 237–246.
  • [18] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023, pp. 28492–28518.
  • [19] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” Preprint arXiv:2303.01037, 2023.
  • [20] A. Zeyer, R. Schlüter, and H. Ney, “Towards online-recognition with deep bidirectional LSTM acoustic models,” in Interspeech, San Francisco, CA, USA, Sept. 2016.
  • [21] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for ASR using self-attention network and chunk-hop**,” in ICASSP, 2019.
  • [22] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in ASRU, 2019, pp. 427–433.
  • [23] B. Zhang, D. Wu, Z. Yao, X. Wang, F. Yu, C. Yang, L. Guo, Y. Hu, L. Xie, and X. Lei, “Unified streaming and non-streaming two-pass end-to-end model for speech recognition,” Preprint arXiv:2012.05481, 2020.
  • [24] Y. Shi, Y. Wang, C. Wu, C. Yeh, J. Chan, F. Zhang, D. Le, and M. Seltzer, “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP, 2021, pp. 6783–6787.
  • [25] X. Chen, Y. Wu, Z. Wang, S. Liu, and J. Li, “Develo** real-time streaming transformer transducer for speech recognition on large-scale dataset,” in ICASSP, 2021, pp. 5904–5908.
  • [26] K. An, H. Zheng, Z. Ou, H. Xiang, K. Ding, and G. Wan, “CUSIDE: Chunking, simulating future context and decoding for streaming ASR,” in Interspeech, 2022, pp. 2103–2107.
  • [27] F. Weninger, M. Gaudesi, M. A. Haidar, N. Ferri, J. Andrés-Ferrer, and P. Zhan, “Conformer with dual-mode chunked attention for joint online and offline ASR,” in Interspeech, 2022.
  • [28] P. Swietojanski, S. Braun, D. Can, T. F. Da Silva, A. Ghoshal, T. Hori, R. Hsiao, H. Mason, E. McDermott, H. Silovsky, R. Travadi, and X. Zhuang, “Variable attention masking for configurable transformer transducer speech recognition,” in ICASSP, 2023, pp. 1–5.
  • [29] H. Gulzar, M. R. Busto, T. Eda, K. Itoyama, and K. Nakadai, “miniStreamer: Enhancing small conformer with chunked-context masking for streaming ASR applications on the edge,” in Interspeech, 2023, pp. 3277–3281.
  • [30] C. Wang, Y. Wu, L. Lu, S. Liu, J. Li, G. Ye, and M. Zhou, “Low latency end-to-end streaming speech recognition with a scout network,” in Interspeech, 2020, pp. 2112–2116.
  • [31] N. Moritz, T. Hori, and J. L. Roux, “Dual causal/non-causal self-attention for streaming end-to-end speech recognition,” in Interspeech, 2021, pp. 1822–1826.
  • [32] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015.
  • [33] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016, pp. 4960–4964.
  • [34] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019, pp. 2613–2617.
  • [35] Z. Tüske, G. Saon, K. Audhkhasi, and B. Kingsbury, “Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard,” in Interspeech, Oct. 2020.
  • [36] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, Oct. 2020, pp. 5036–5040.
  • [37] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [38] D. Krueger, T. Maharaj, J. Kramar, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing RNNs by randomly preserving hidden activations,” in ICLR, 2017.
  • [39] W. Zhou, A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “Equivalence of segmental and neural transducer modeling: A proof of concept,” in Interspeech, Aug. 2021, pp. 2891–2895.
  • [40] G. Saon, Z. Tüske, and K. Audhkhasi, “Alignment-length synchronous decoding for RNN transducer,” in ICASSP, 2020.
  • [41] M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “Investigating methods to improve language model integration for attention-based encoder-decoder ASR models,” in Interspeech, Aug. 2021, pp. 2856–2860.
  • [42] A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney, “Librispeech transducer model with internal language model prior correction,” in Interspeech, Aug. 2021, pp. 2052–2056.
  • [43] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
  • [44] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014, pp. 3935–3939.
  • [45] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in ACL, 2016.
  • [46] A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexible neural toolkit with application to translation and speech recognition,” in ACL, Melbourne, Australia, 2018.
  • [47] TensorFlow development team, “TensorFlow: Large-scale machine learning on heterogeneous distributed systems,” Preprint arXiv:1603.04467, 2016.
  • [48] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017.
  • [49] W. Zhou, R. Schlüter, and H. Ney, “Robust beam search for encoder-decoder attention based speech recognition without length bias,” in Interspeech, 2020, pp. 1768–1772.
  • [50] K. Murray and D. Chiang, “Correcting length bias in neural machine translation,” in WMT, 2018.