Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

Abstract

We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.

Index Terms— Chunked attention models, transducer, streamable

1 Introduction & Related Work

Among the potential streaming models, there are the traditional HMM [1], CTC [2] and more recently transducer [3]. While many streamable attention-based encoder-decoder (AED) models were proposed [4, 5, 6, 7, 8], they are too complicated, relying on too much heuristics and not being robust enough in comparison to the transducer [9].

Here we show, how a seemingly very simple modification makes the AED model streamable and turns out to be very robust and competitive, specifically on long-form speech, in contrast to many other AED and transducer models [10, 11, 12, 8, 9]. Interestingly, the small modification leads to an equivalence to transducer models, and we study the exact modeling differences.

We use chunking as the core mechanism for both the encoder and cross-attention in the decoder. This means that we take out chunks (windows) of fixed width and fixed step sizes (striding). The static step size implies that we have a variable number of labels per chunk. The static sizes in the encoder also allow for efficient processing in training and recognition, more efficient than causal self-attention and also performing better.

Related to chunkwise processing is the operation on segments with variable boundaries in segmental attention models [8], or on fixed-size windows at variable positions [13]. Having variable positions or segment boundaries allows to use a single label per window or segment. In contrast, using fixed-size chunks at fixed positions implies that we have a variable number of labels per chunk. Further, we can use the same chunking in the encoder, with the big advantage that we can parallelize the training computation in the encoder independent of the alignment.

Similar chunking in the decoder has been done in [14, 15, 16, 17, 18, 19] and similar chunking in the encoder has been done in [20, 21, 22, 23, 7, 24, 25, 26, 27, 28, 29]. There are also other approaches to make self-attention in the encoder streamable [30, 31, 9].

2 Global AED Model

Our baseline is the standard global attention-based encoder-decoder (AED) model [32] adapted for speech recognition [4, 33, 34, 35]. We use a Conformer-based encoder [36]. The model operates on a sequence of audio feature frames $x_{1:T}\in\mathbb{R}^{T\times D}$ (10ms resolution) of length $T$ as input and encodes it as a sequence

h_{1:T^{\prime}}=\operatorname{GlobalEncoder}(x_{1:T})\in\mathbb{R}^{T^{\prime% }\times D_{\textrm{enc}}}

of length $T^{\prime}$ and encoder feature dimension $D_{\textrm{enc}}$ . The encoder has a convolutional frontend with striding in time which downsamples the input by a factor of 6. Thus, the encoder outputs a frame every 60ms and $T^{\prime}=\lceil\frac{T}{6}\rceil$ .

The probability of the output label sequence $a_{1:S}\in\mathcal{A}^{S}$ given the encoder output sequence $h_{1:T}$ is defined as

\displaystyle p(a_{1:S}\mid h_{1:T^{\prime}})

\displaystyle=\prod_{s=1}^{S}p(a_{s}\mid a_{1:s-1},h_{1:T^{\prime}}).

We have $a_{S}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOS}$}}}}$ to mark the end of the sequence (EOS), which implicitly models the probability of the sequence length. This part of the model is called the decoder. The decoder uses global attention on $h_{1:T}$ per output step $s$ . The main and sole difference of the global decoder vs. the chunked decoder is global attention vs. chunked attention. The decoder is defined below.

3 Chunked AED Model

Refer to caption — Fig. 1: Chunking on input frames $x_{1:T}$ with chunk center size $T_{w}$ , right context $T_{r}$ and stride $T_{s}$ , where we have $T_{s}=T_{w}$ .

As visualized in Fig. 1, we extract strided windows called ‘chunks’ with chunk size $T_{w}$ and stride $T_{s}$ . For input $x_{1:T}=(x_{1},\dots,x_{T})\in\mathbb{R}^{T\times D}$ , we get the chunks $x^{\prime}_{1:K,1:T_{w}}\in\mathbb{R}^{K\times T_{w}\times D}$ with $x^{\prime}_{k,1:T_{w}}\in\mathbb{R}^{T_{w}\times D}$ for chunk index $k\in\{1,\dots,K\}$ with $K=\lceil\frac{T}{T_{s}}\rceil$ , where

x^{\prime}_{k,t}=x_{(k-1)\cdot T_{s}+t}\in\mathbb{R}^{D},\quad t\in\{1,\dots,T% _{w}\}.

Additionally, we might extend the chunk size by $T_{r}$ more frames to get some extended right context.

For the streaming model, the chunking is applied directly on the input (e.g. log mel features every 10ms), and then a variant of the Conformer encoder works on the chunks $x^{\prime}_{1:K,1:T_{w}}$ and calculates the encoder output

h^{\prime}_{1:K,1:T^{\prime}_{w}}=\operatorname{ChunkedEncoder}(x^{\prime}_{1:% K,1:T_{w}})\in\mathbb{R}^{K\times T^{\prime}_{w}\times D_{\textrm{enc}}}

where $T^{\prime}_{w}=\lceil\frac{T_{w}}{6}\rceil$ .

For comparison, we also use a standard Conformer with global attention applied on the whole input

h_{1:T^{\prime}}=\operatorname{GlobalEncoder}(x_{1:T})\in\mathbb{R}^{T^{\prime% }\times D_{\textrm{enc}}}

and apply chunking on the encoder output $h_{1:T^{\prime}}$ such that we get the chunked encoder output $h^{\prime}_{1:K,1:T^{\prime}_{w}}$ .

3.1 Streamable Chunked Encoder

Our starting point is the standard Conformer, operating on chunks instead of the whole sequence, i.e. operating on $x_{k,1:T_{w}}$ for every chunk index $k$ . The self-attention is calculated per chunk, i.e. both the chunk center and right context frames, and attends to all frames within the chunk, and additionally to the previous chunk, as can be seen in Fig. 2. Thus it is non-causal within the chunk, just like the convolution. The decoder cross-attention will afterwards only access the chunk center frames, thus we expect that the chunk center covers the labels for this chunk. The future lookahead via the right context frames does not accumulate over multiple layers, unlike the history context, where we access the previous chunk, thus the history context does accumulate over multiple layers. This also explains why we don’t need to have any additional left context frames within the chunk.

In training, we can calculate all chunks in parallel, and the self-attention calculation per chunk is more efficient compared to the global self-attention. We only get a small overhead due to the overlap of the chunk via the right context frames.

Note that this is mathematically equivalent to the same kind of look-ahead context leaking avoidance as in the Emformer [24] and dual causal/non-causal self-attention [31].

3.2 Streamable Chunked Decoder

In the output vocabulary $\mathcal{A}$ , we replace the $\mathrm{EOS}$ by a new special end-of-chunk (EOC) symbol $\mathrm{EOC}$ . We start with the first chunk ( $k=1$ ), and once we get $\mathrm{EOC}$ , we advance to the next chunk ( $k^{\prime}=k+1$ ). The decoder is exactly like in the global AED model, except that the global attention is replaced by attention on the current chunk. The possible transitions can be seen in Fig. 3.

The probability to emit the next label $a_{s}\in\mathcal{A}$ is estimated using a LSTM [37] with zoneout [38] and MLP cross-attention [32] to the current chunk of the encoder:

	$\displaystyle p(a_{s}\mid...)$	$\displaystyle=(\operatorname{softmax}\circ\operatorname{Linear}\circ% \operatorname{maxout}\circ\operatorname{Linear})\big{(}g_{s},c_{s}\big{)}$
	$\displaystyle g_{s}$	$\displaystyle=\operatorname{ZoneoutLSTM}(c_{1:s-1},a_{1:s-1})$
	$\displaystyle c_{s}$	$\displaystyle=\sum_{t=1}^{T^{\prime}_{w}}\alpha_{s,t}\cdot h^{\prime}_{k_{s},t% }\in\mathbb{R}^{D_{\textrm{enc}}}$
	$\displaystyle\alpha_{s,t}$	$\displaystyle=\frac{\exp(e_{s,t})}{\sum_{\tau=1}^{T^{\prime}_{w}}\exp(e_{s,% \tau})}\in\mathbb{R},\quad t\in\{1,\dots,T^{\prime}_{w}\}$
	$\displaystyle e_{s,t}$	$\displaystyle=(\operatorname{Linear}\circ\tanh\circ\operatorname{Linear})\big{% (}g_{s},h_{t}\big{)}\in\mathbb{R},$

and the current chunk index $k_{s}$ is defined as

k_{s}=\begin{cases}k_{s-1}+1,&a_{s-1}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{% gray!10}{$\mathrm{EOC}$}}}}\\ k_{s-1},&a_{s-1}\neq{{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{% EOC}$}}}}\end{cases}

and initially $k_{1}=1$ . The sequence is ended when we reach $k_{s}=K$ and $a_{s}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOC}$}}}}$ . The attention weights here are only calculated inside the current chunk. Further, we do not use attention weight feedback. Otherwise the model is exactly the same as the global attention decoder, to allow for direct comparisons, and also to import model parameters.

We realize that the chunked decoder is equivalent to a transducer model [3, 12], where $\mathrm{EOC}$ behaves exactly like the blank symbol, and we iterate over chunks instead of frames, which is like a higher downsampling rate. A similar observation for a similar model has been made in [16]. The main difference is the cross-attention and the decoder LSTM dependence on the encoder output. Note that this is a different kind of equivalence compared to [39], where a segmental model is rewritten in a framewise manner.

3.3 Training

We create a chunkwise alignment from an existing framewise alignment, then add the EOC labels, and train with labelwise cross-entropy, just like the standard AED training. This is different to the standard transducer training, which performs a full sum over all alignment paths. The standard transducer training criterion cannot be applied easily here due to the alignment label dependencies [12, 9].

3.4 Beam Search

We perform alignment-synchronous search, meaning that in each step, all hypotheses have the same number of labels, including $\mathrm{EOC}$ . It is exactly the same as the alignment-synchronous transducer search [40, 12].

For the very best results, we make use of an external language model (LM) and perform internal language model (ILM) prior correction [41]. Note that the chunked AED model has the EOC label (blank label) instead of the EOS label. We use the scores

P(a_{s}|...)=\begin{cases}P_{\textrm{AED}}^{\alpha}(a_{s}|...)\\ {}\quad\cdot P^{\beta}_{\textrm{LM}}(a_{s}|...)\\ {}\quad\cdot P^{-\lambda}_{\textrm{ILM}}(a_{s}|...),&a_{s}\neq{{\scalebox{0.6}% [1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOC}$}}}}\\ P_{\textrm{AED}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOC% }$}}}}|...),&a_{s}={{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{% EOC}$}}}},k<K\\ P_{\textrm{AED}}^{\alpha}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$% \mathrm{EOC}$}}}}|...)\\ {}\quad\cdot P^{\beta}_{\textrm{LM}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{% gray!10}{$\mathrm{EOS}$}}}}|...)\\ {}\quad\cdot P^{-\lambda}_{\textrm{ILM}}({{\scalebox{0.6}[1.0]{\fcolorbox{% white}{gray!10}{$\mathrm{EOS}$}}}}|...),&a_{s}={{\scalebox{0.6}[1.0]{% \fcolorbox{white}{gray!10}{$\mathrm{EOC}$}}}},k=K\end{cases}

where $\alpha$ , $\beta$ , and $\lambda$ are tuned scales and we set $\alpha=1-\beta$ . $P_{\textrm{LM}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOS}% $}}}}|...)$ and $P_{\textrm{ILM}}({{\scalebox{0.6}[1.0]{\fcolorbox{white}{gray!10}{$\mathrm{EOS% }$}}}}|...)$ are set to 0 for $k<K$ and then renormalized. This is very similar to the EOS handling for transducer models with ILM prior correction [42] except that our ILM also has EOS and the renormalization. We use Mini-LSTM ILM method [41] that is trained on the train transcription labels with $\mathrm{EOS}$ same as LM training data.

4 Experiments

We conduct experiments on LibriSpeech 960h [43] and TED-LIUM-v2 200h [44] using BPE labels [45]. We use RETURNN [46] based on TensorFlow [47]. All code including full recipes are online¹¹1https://github.com/rwth-i6/returnn-experiments/tree/master/2023-chunked-aed.

We train the global AED model for 100 epochs using a single consumer GPU. We apply on-the-fly speed perturbation and SpecAugment [34]. The encoder consists of 12 Conformer layers with 512 model dims. and decoder LSTM has 1024 dims. We use an aux. CTC loss [48] on top of encoder output for better training convergence and for the alignments.

To train the chunked AED models, we extract time-synchronous alignments from the jointly trained CTC model with disallowed label loop. Then, we convert such alignment into a chunk-synchronous one and use that as targets for cross-entropy training. We initialize all parameters using the best checkpoint of the global AED model and train for 15-30 epochs.

4.1 Chunked Decoder

Table 1: WERs [%], studying chunked decoder with different chunk sizes with no overlap when using global encoder.

\infty

means global decoder. The frame rate of

h

is 60 ms.

Chunk size		TED-v2		LibriSpeech
$T^{\prime}_{w}$	Sec.	dev	test	dev-other	test-other
$1$	$0.06$	7.5	7.3	5.8	6.0
$5$	$0.3$	7.3	7.1	5.7	5.9
$10$	$0.6$	7.3	6.9	5.7	5.7
$25$	$1.5$	7.4	6.9	5.6	5.7
$\infty$		7.4	6.9	5.6	5.7

First, we investigate the effect of chunking only in the decoder, i.e. chunking the output $h$ of the global encoder. Results on TED-LIUM-v2 and LibriSpeech are shown in Table 1. We can observe that we are able to achieve same WERs as the global AED model even with small chunk sizes.

4.2 Chunked Encoder-Decoder

Table 2: For chunked AED, effect on WERs[%] for carry-over history context, center chunk size

T_{w}

, lookahead future context

T_{r}

. All sizes are in seconds.

Carry- over	Chunk size	Look- ahead	TED-v2		LibriSpeech
Carry- over	Chunk size	Look- ahead	dev	test	dev-o.	test-o.
2.4	0.6	0.3	8.2	7.6	7.2	7.4
		0.6	7.9	7.4	6.8	6.8
		0.9	7.7	7.1	6.6	6.7
0	1.2	0.3	8.6	8.0	7.1	7.0
1.2			7.9	7.3	6.8	6.8
2.4			7.7	7.3	6.7	6.7
3.6			7.7	7.3	6.7	6.7
2.4	1.2	0	10.2	9.7	7.8	7.8
		0.6	7.8	7.2	6.5	6.6
		0.9	7.5	7.1	6.2	6.3
3.0	1.5	0.3	7.7	7.3	6.3	6.3
3.6	1.8	0.3	7.5	7.1	6.2	6.2
$\infty$			7.4	6.9	5.6	5.7

Table 2 shows WER results of the chunked AED model. We observe that carrying over left context yields improvement, where 2.4 seconds is enough. In addition, using future lookahead gives good improvements in all cases. The chunked AED model with a total chunk size and lookahead of 2.1 seconds achieves a WER of $7.1\%$ and $6.2\%$ on TED-LIUM-v2 and LibriSpeech test sets respectively, a relative increase in WER of $4\%$ and $9\%$ compared to global AED model.

4.3 Latency

Table 3: Word emit latency for chunked AED model on TED-LIUM-v2 dev dataset. All timing values are in seconds.

Carry- over	Chunk size	Look- ahead	Latency			WER [%]
Carry- over	Chunk size	Look- ahead	$\%50^{\textrm{th}}$	$\%95^{\textrm{th}}$	$\%99^{\textrm{th}}$	dev
2.4	0.6	0.9	1.08	1.39	1.44	7.7
	1.2	0.3	0.78	1.34	1.42	7.7
		0.6	1.08	1.63	1.71	7.8
		0.9	1.39	1.94	2.02	7.5
3.6	1.8	0.3	1.11	1.90	2.01	7.5

We compute the difference between the word end time from a GMM alignment and the chunk end time in which the word is emitted by the chunked AED model. Word emit latency measures can be found in Table 3. Lookahead seems to add more latency compared to using larger chunk sizes (rows: 1 vs 2, 4 vs 5).

4.4 Long-Form Recognition

Table 4: WERs [%] of long-form speech recognition on TED-LIUM-v2 test dataset with

\mathcal{C}

concatenated sequences.

$\mathcal{C}$	Sequence lengths (sec)		Global Enc.		Ch. Enc.
$\mathcal{C}$	000.00Mean $\pm$ Std	00.00Min - Max	Gl. Dec.	Chunk Dec.
$1$	008.20 $\pm$ 04.30	00.35 - 032.55	$6.9$	$6.9$	$7.3$
$2$	023.10 $\pm$ 08.50	00.41 - 045.70	$7.0$	$6.9$	$7.1$
$4$	033.70 $\pm$ 11.90	00.41 - 070.70	$9.2$	$7.0$	$7.0$
$8$	065.95 $\pm$ 22.19	07.19 - 116.99	$23.4$	$7.1$	$7.1$
$10$	082.51 $\pm$ 26.87	15.67 - 142.08	$34.2$	$7.1$	$7.0$
$20$	160.14 $\pm$ 53.98	17.83 - 237.27	$62.4$	$7.1$	$7.0$

We investigate the generalization on long-form speech recognition. We conduct these experiments on TED-LIUM-v2 by concatenating $\mathcal{C}$ consecutive sequences from the same recording to create much longer sequences than what was seen in training. We compare the global AED baseline to a chunked AED model with left context carry-over $2.4$ sec, chunk size $1.2$ sec, lookahead $0.32$ sec and to a chunked-decoder with global encoder. From the results in Table 4, we can observe that the global AED becomes much worse on longer sequences whereas the chunked AED model generalizes very well and even improves, which is probably because the decoder now has better LM context. This is also the case when only the decoder is chunked. The relative positional encoding in the encoder is probably helpful. The generalization is much better than other variants such as segmental AED model [8], although that work uses an LSTM-based encoder.

4.5 Beam Size and Length Normalization

Table 5: Comparison of effect of beam sizes and length normalization between global AED and chunked AED models. WERs [%] on TED-LIUM-v2 test dataset.

Length Norm.	Beam	Global	Chunked
(No influence)	$1$	$7.1$	$7.4$
Yes	$12$	$6.9$	$7.3$
	$32$	$6.9$	$7.3$
	$64$	$6.9$	$7.3$
No	$12$	$7.0$	$7.3$
	$32$	$8.5$	$7.3$
	$64$	$10.9$	$7.3$

Global AED model suffers from the length bias problem [49] because there is no explicit length modeling which pushes the model to prefer short hypothesis, especially when increasing the beam size. However, the chunked AED model, like transducer, does not have the length bias issue since this is modeled by the EOC symbol. To verify this, we run experiments with different beam sizes and optional length normalization [50] for both global AED and chunked AED model on TED-LIUM-v2 test dataset. Results are shown in Table 5. The global AED model degrades a lot as we increase beam size and disable length normalization whereas the chunked AED model does not need such heuristic and performance remains consistent. Additionally, both models perform marginally worse with greedy recognition.

4.6 External Language Model

Table 6: WERs [%] with Transformer and LSTM language model integration on LibriSpeech dataset.

Model	LM	ILM	dev-other	test-other
Global AED	-	-	5.6	5.7
	LSTM	No	4.6	5.0
	LSTM	Yes	4.3	4.5
	Trafo	Yes	3.7	4.2
Chunked AED	-	-	6.2	6.2
	LSTM	No	5.2	5.3
	LSTM	Yes	4.5	4.8
	Trafo	Yes	4.4	4.7

Table 6 shows results with LM integration on LibriSpeech dataset. The chunked AED model used is the best model from Table 2. Interestingly, the WER performance gap between global AED and chunked AED is reduced when using LSTM LM and ILM subtraction. Both models gain huge improvement from the LM integration.

4.7 Comparison to Transducer

Table 7: WERs [%] for transition towards original transducer, using global encoder, chunked decoder, chunk size 1.

Model	TED-v2		LibriSpeech
Model	dev	test	dev-other	test-other
Baseline with $T^{\prime}_{w}=1$	7.5	7.3	5.8	6.0
+ EOC masking in $g$	7.6	7.2	5.8	6.1
+ Remove $c_{s}$ dep. in $g$	7.7	7.4	6.0	6.1

We study the transition from a chunked AED model with chunk size 1 into a transducer model [3]. The attention context vector $c_{s}$ in this variant is the encoder hidden representation $h_{t}$ at $t=k_{s}$ because the model attends to a single frame at a time. We first mask out the $\mathrm{EOC}$ labels (blank in transducer) from the decoder LSTM $g$ , as the decoder LSTM in the original transducer only operates on non-blank labels. Further, we completely remove the dependency to the encoder $h$ from the decoder LSTM, just like in the original transducer. Results are shown in Table 7. We see that the additional dependencies seem to be helpful, consistent with [12].

5 Conclusion

In this work, we investigate a streamable chunked attention-based encoder-decoder (AED) model. We show that this model is competitive compared to non-streamable global AED model and generalizes very well on long-form speech recognition. All degradations occur only in the chunked encoder – a chunked decoder with global encoder performs just as well as the global AED model. We study the equivalence to the transducer model and find the extensions to be helpful.

ACKNOWLEDGEMENT

This work was partially supported by NeuroSys, which as part of the initiative “Clusters4Future” is funded by the Federal Ministry of Education and Research BMBF (03ZU1106DA), and by the project RESCALE within the program AI Lighthouse Projects for the Environment, Climate, Nature and Resources funded by the Federal Ministry for the Environment, Nature Conservation, Nuclear Safety and Consumer Protection (BMUV), funding ID: 67KI32006A. We thank Wei Zhou, Nick Rossenbach, Zoltán Tüske, Zijian Yang for useful discussions.

References

[1] H. A. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach, vol. 247, Springer Science & Business Media, 1994.
[2] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in ICML, New York, NY, USA, 2006, p. 369–376.
[3] A. Graves, “Sequence transduction with recurrent neural networks,” Preprint arXiv:1211.3711, 2012.
[4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in NIPS, 2015, pp. 577–585.
[5] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in ICLR, 2018.
[6] R. Hsiao, D. Can, T. Ng, R. Travadi, and A. Ghoshal, “Online automatic speech recognition with listen, attend and spell model,” IEEE Signal Processing Letters, vol. 27, 2020.
[7] E. Tsunoo, Y. Kashiwagi, and S. Watanabe, “Streaming transformer ASR with blockwise synchronous beam search,” in SLT, 2021, pp. 22–29.
[8] A. Zeyer, R. Schmitt, W. Zhou, R. Schlüter, and H. Ney, “Monotonic segmental attention for automatic speech recognition,” in IEEE Spoken Language Technology Workshop, Doha, Qatar, Jan. 2023, pp. 229–236.
[9] R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, and S. Watanabe, “End-to-end speech recognition: A survey,” Preprint arXiv:2303.03329, 2023.
[10] A. Narayanan, R. Prabhavalkar, C.-C. Chiu, D. Rybach, T. N. Sainath, and T. Strohman, “Recognizing long-form speech using streaming end-to-end models,” in ASRU, 2019.
[11] C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan, R. Prabhavalkar, Z. Chen, T. Sainath, and Y. Wu, “A comparison of end-to-end models for long-form speech recognition,” in ASRU, 2019, pp. 889–896.
[12] A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “A new training pipeline for an improved neural transducer,” in Interspeech, Shanghai, China, Oct. 2020, pp. 2812–2816.
[13] A. Zeyer, R. Schlüter, and H. Ney, “A study of latent monotonic attention variants,” Preprint arXiv:2103.16710, 2021.
[14] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and S. Bengio, “An online sequence-to-sequence model using partial conditioning,” in NIPS, 2016, vol. 29.
[15] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, and Z. Chen, “Improving the performance of online neural transducer models,” in ICASSP, 2018.
[16] Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “Synchronous transformers for end-to-end speech recognition,” in ICASSP, May 2020, pp. 7884–7888.
[17] P. Wilken, T. Alkhouli, E. Matusov, and P. Golik, “Neural simultaneous speech translation using alignment-based chunking,” in SLT, Online, July 2020, pp. 237–246.
[18] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023, pp. 28492–28518.
[19] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” Preprint arXiv:2303.01037, 2023.
[20] A. Zeyer, R. Schlüter, and H. Ney, “Towards online-recognition with deep bidirectional LSTM acoustic models,” in Interspeech, San Francisco, CA, USA, Sept. 2016.
[21] L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for ASR using self-attention network and chunk-hop**,” in ICASSP, 2019.
[22] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in ASRU, 2019, pp. 427–433.
[23] B. Zhang, D. Wu, Z. Yao, X. Wang, F. Yu, C. Yang, L. Guo, Y. Hu, L. Xie, and X. Lei, “Unified streaming and non-streaming two-pass end-to-end model for speech recognition,” Preprint arXiv:2012.05481, 2020.
[24] Y. Shi, Y. Wang, C. Wu, C. Yeh, J. Chan, F. Zhang, D. Le, and M. Seltzer, “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in ICASSP, 2021, pp. 6783–6787.
[25] X. Chen, Y. Wu, Z. Wang, S. Liu, and J. Li, “Develo** real-time streaming transformer transducer for speech recognition on large-scale dataset,” in ICASSP, 2021, pp. 5904–5908.
[26] K. An, H. Zheng, Z. Ou, H. Xiang, K. Ding, and G. Wan, “CUSIDE: Chunking, simulating future context and decoding for streaming ASR,” in Interspeech, 2022, pp. 2103–2107.
[27] F. Weninger, M. Gaudesi, M. A. Haidar, N. Ferri, J. Andrés-Ferrer, and P. Zhan, “Conformer with dual-mode chunked attention for joint online and offline ASR,” in Interspeech, 2022.
[28] P. Swietojanski, S. Braun, D. Can, T. F. Da Silva, A. Ghoshal, T. Hori, R. Hsiao, H. Mason, E. McDermott, H. Silovsky, R. Travadi, and X. Zhuang, “Variable attention masking for configurable transformer transducer speech recognition,” in ICASSP, 2023, pp. 1–5.
[29] H. Gulzar, M. R. Busto, T. Eda, K. Itoyama, and K. Nakadai, “miniStreamer: Enhancing small conformer with chunked-context masking for streaming ASR applications on the edge,” in Interspeech, 2023, pp. 3277–3281.
[30] C. Wang, Y. Wu, L. Lu, S. Liu, J. Li, G. Ye, and M. Zhou, “Low latency end-to-end streaming speech recognition with a scout network,” in Interspeech, 2020, pp. 2112–2116.
[31] N. Moritz, T. Hori, and J. L. Roux, “Dual causal/non-causal self-attention for streaming end-to-end speech recognition,” in Interspeech, 2021, pp. 1822–1826.
[32] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015.
[33] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016, pp. 4960–4964.
[34] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019, pp. 2613–2617.
[35] Z. Tüske, G. Saon, K. Audhkhasi, and B. Kingsbury, “Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard,” in Interspeech, Oct. 2020.
[36] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, Oct. 2020, pp. 5036–5040.
[37] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[38] D. Krueger, T. Maharaj, J. Kramar, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing RNNs by randomly preserving hidden activations,” in ICLR, 2017.
[39] W. Zhou, A. Zeyer, A. Merboldt, R. Schlüter, and H. Ney, “Equivalence of segmental and neural transducer modeling: A proof of concept,” in Interspeech, Aug. 2021, pp. 2891–2895.
[40] G. Saon, Z. Tüske, and K. Audhkhasi, “Alignment-length synchronous decoding for RNN transducer,” in ICASSP, 2020.
[41] M. Zeineldeen, A. Glushko, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “Investigating methods to improve language model integration for attention-based encoder-decoder ASR models,” in Interspeech, Aug. 2021, pp. 2856–2860.
[42] A. Zeyer, A. Merboldt, W. Michel, R. Schlüter, and H. Ney, “Librispeech transducer model with internal language model prior correction,” in Interspeech, Aug. 2021, pp. 2052–2056.
[43] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
[44] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014, pp. 3935–3939.
[45] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in ACL, 2016.
[46] A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexible neural toolkit with application to translation and speech recognition,” in ACL, Melbourne, Australia, 2018.
[47] TensorFlow development team, “TensorFlow: Large-scale machine learning on heterogeneous distributed systems,” Preprint arXiv:1603.04467, 2016.
[48] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017.
[49] W. Zhou, R. Schlüter, and H. Ney, “Robust beam search for encoder-decoder attention based speech recognition without length bias,” in Interspeech, 2020, pp. 1768–1772.
[50] K. Murray and D. Chiang, “Correcting length bias in neural machine translation,” in WMT, 2018.