\interspeechcameraready\name

[affiliation=1,2]PeikunChen \name[affiliation=2]SiningSun \name[affiliation=2]ChanghaoShan \name[affiliation=2]QingYang \name[affiliation=1*]LeiXie

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Abstract

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model’s contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts. The code we used for this work can be found here¹¹1https://github.com/chenpk00/IS2024_stream_decoder_only_asr.

keywords:

streaming automatic speech recognition, discrete-token, decoder-only Transformer

1 Introduction

Recently, large language models (LLMs) [1, 2, 3, 4] have made great progress in various natural language processing (NLP) tasks. Methods based on LLMs are also leading a revolution in other research fields, such as computing vision [5, 6, 7] and speech [8, 9, 10, 11]. Thanks to the powerful understanding of language and generalization ability of an LLM, transferring LLM pretrained from massive amounts of text data to speech recognition tasks can also bring significant word error rate reduction [12, 13]. As for the model architecture, most of the LLMs are based on decoder-only transformers [14], which simplifies the model structure used for ASR tasks. Therefore, LLM-based ASR is becoming a hot research topic in the field.

Currently, there are two primary approaches to integrating the speech modality with LLMs in LLM-based ASR. One approach involves directly integrating continuous features with text embedding through a trainable adaptor [15, 16, 17]. This kind of approach introduces additional acoustic encoders and models speech and text separately. Conversely, the other approach is to treat speech representation as textual tokens and employ a decoder-only model to optimize multi-modal tasks effectively. For instance, VioLA [8] converts continuous speech signals to discrete codec codes via EnCodec [18] and unifies several speech-related tasks into a conditional language modeling task. SpeechGPT [9] employs LLaMA [2] as its foundational framework, utilizing $k$ -means clustering derived from Hubert [19] to tokenize speech signal. Similarly, AudioPaLM [10] utilizes PaLM-2 [20] as its underlying architecture and extracts discrete tokens from the encoder of Universal Speech Model [21]. Such a unified modeling approach has been demonstrated to effectively improve the ASR performance. The decoder-only transformer model with unified discrete input provides a new paradigm for various speech-related tasks, including but not limited to speech recognition and speech synthesis.

Previous works on decoder-only ASR tasks mainly focused on non-streaming scenarios [8, 10]. However, real-time streaming recognition can give faster recognition results and a better user experience in real-world applications. Many works have been proposed to build a faster and better streaming ASR system in the past decades based on various end-to-end speech recognition frameworks [22, 23]. However, the exploration of streaming decoder-only speech recognition is very limited. As decoder-only-based ASR model performance improves and the number of model parameters increases, streaming low-latency inference becomes a challenging task.

This paper presents a pilot study on the streaming decoder-only transformer ASR model. Current non-streaming decoder-only transformer ASR models learn to predict text tokens autoregressively using the whole speech utterance [24]. For the streaming version, it is necessary to emit text tokens with minimal delay as the corresponding speech segment is received. To this end, we investigate two approaches based on the speech-to-text alignment obtained by a GMM-HMM model [25]. Specifically, the first approach, Text Token Insertion (TTI), inserts the corresponding text tokens into the speech token sequences directly under the guide of speech-to-text alignment during training. By contrast, in Boundary Token Insertion (BTI), special “boundary tokens” are inserted into the speech token sequences in the same way with text tokens added at the end, effectively decoupling the speech and text modalities. Upon triggering a boundary token, the corresponding text token can be generated through a one-step inference process autoregressively. Meanwhile, we introduce right-chunk attention and various data augmentation techniques to improve the streaming model’s contextual modeling ability. We also explore the efficacy of leveraging an off-the-shelf text LLM to initialize our streaming ASR model.

Refer to caption — (a) Non-streaming model

Experiments show that our proposed streaming decoder-only model can obtain 5.9% and 7.2% character error rate (CER) on two Chinese Mandarin corpora, AISHELL-1 [26] and AISHELL-2 [27], respectively, showing competitive performance with the non-streaming counterpart. Our results also show that the streaming decoder-only Transformer ASR model can benefit from the initialization from an off-the-shelf text LLM, such as Qwen [4].

2 Proposed Method

2.1 Streaming decoder-only model architecture

Figure 1 illustrates three types of decoder-only models designed for ASR tasks based on discrete speech token input. Among them, Figure 1 (a) represents a non-streaming model proposed in [24], whereas Figures 1 (b) and (c) depict two variations of streaming frameworks. Both models (b) and (c) necessitate force alignment between speech and text. As shown in equation (1), given discrete speech token sequence $x=(x_{1},...,x_{t},...x_{T})$ and the correspoing text token sequence $y=(y1,...,y_{L})$ , streaming ASR model is optimized by maximizing the conditional probability, where $t_{y_{i}+\Delta}$ is the time of emitting text token $y_{i}$ , $\Delta$ is a constant, which means how many right context tokens can be used, $x_{\leq t_{y_{i}}+\Delta}=({x_{1},...,x_{t_{y_{i}}+\Delta}})$ and $\theta$ is the trainable model parameter.

p(y|x;\theta)={\textstyle\prod_{1}^{L}}p\left(y_{i}\mid x_{\leq t_{y_{i}+% \Delta}},\theta\right).

(1)

In Figure 1 (b), the Text Token Insertion (TTI) approach showcases the interleaving of discrete speech and text tokens, with text tokens inserted into speech tokens at the end of the corresponding speech segment. Mathematically, it equals to optimize equation (1) directly. However, the mix of text and speech tokens complicates the use of beam search during decoding. During inference, speech tokens can be treated as conditions. Given the interleaved nature of text and speech tokens, triggering a text token during beam search necessitates caching all historical hidden states (e.g., key and value of self-attention) for each search path.

Figure 1 (c) illustrates the Boundary Token Insertion (BTI) approach, where a special token, instead of a text token, is inserted into the speech token sequence, effectively decoupling the text and speech token sequences. In contrast to Figure 1 (b), this process can be viewed as comprising two stages: the first stage involves determining the boundary position, while the second stage entails predicting the corresponding specific text token conditioned on the history of speech tokens. Equation (2) provides a formal definition. Here, a hidden variable $b$ is introduced, where $b=(b_{1},...,b_{T})\in{0,1}^{T}$ represents one of the possible boundary paths, and $\beta$ denotes the set of all possible paths. However, in practice, optimizing by summing over all possible paths is computationally challenging. Therefore, we opt to approximate this optimization problem by selecting the most probable path $b_{p}$ .

\begin{split}p\left(y\mid x,\theta\right)&={\sum_{b\in\beta}}p\left(y,b\mid x,% \theta\right)\\ &={\sum_{b\in\beta}}{\textstyle\prod_{1}^{L}}p\left(y_{i},b\mid x\leq t_{y_{i}% +\Delta},\theta\right)\\ &\approx{\textstyle\prod_{1}^{L}}p\left(y_{i}\mid b_{p},x_{\leq}t_{y_{i}+% \Delta},\theta\right)p\left(b_{p}\mid x_{\leq}t_{y_{i}+\Delta},\theta\right),% \\ \end{split}

(2)

2.2 Right-chunk attention

The previous non-streaming decoder-only ASR models predict the text tokens aggressively with the entire discrete speech tokens. Illustrated in Figure 2 (a), the speech tokens primarily attend to each other, while the text tokens focus on all discrete speech tokens and preceding text tokens.

To achieve streaming speech recognition, we introduce a causal attention mechanism. As shown in the formula (3), a casual mask is applied during self-attention calculation, where $M$ is the mask.

\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QKM}{\sqrt{d_{k}}}\right){V},

(3)

We divide attention mask $M$ into two parts, denoted as $M_{speech}$ and $M_{text}$ , representing the attention mask used for speech and text tokens, respectively. Equations (4) and (5) provide the definitions of $M_{\text{speech}}$ and $M_{\text{text}}$ .

M_{speech}(i,j)=\left\{\begin{array}[]{ll}\text{True }&i\leq j\\ \text{False }&\text{ otherwise }\end{array}\right.,

(4)

where $0\leq i\leq T+L$ because of the insertion of boundary tokens in the speech. Meanwhile, as shown in the figure 1 (b), $mask_{text}$ can be represented by

\begin{split}M_{text}(i,j)=\left\{\begin{array}[]{ll}\text{True }&j\leq t_{y_{% i}}\\ &or\ T+L<i\leq j\leq T+2L\\ \text{False }&\text{otherwise }\end{array}\right.,\end{split}

(5)

where $M_{text}$ consists of two yellow triangles. At each time step, the current text focuses on both the preceding speech and text.

Unlike non-streaming ASR models, streaming ASR models face limitations in considering global information, leading to weakened contextual modeling capabilities. To address this issue, we integrate the right-chunk attention mechanism, illustrated in Figure 2 (c). Unlike causal attention, right-chunk attention in the text portion enables capturing more speech information.

\begin{split}M_{text}(i,j)=\left\{\begin{array}[]{ll}\text{True }&j\leq t_{y_{% i}+\Delta}\\ &or\ T+L<i\leq j\leq T+2L\\ \text{False }&\text{otherwise }\end{array}\right.,\end{split}

(6)

where $\Delta$ means how many speech tokens in the right context can be used. Larger $\Delta$ also means a higher latency.

2.3 Data Pre-processing

Another significant improvement lies in the pre-processing steps. Throughout the training and decoding stages, we observed a tendency for the model to more readily overfit discrete features compared to continuous features, thus significantly impacting the final results. To address this concern, we employed data pre-processing, which plays a crucial role in model training and can substantially enhance data diversity and quantity. To improve the model’s robustness, we implemented the following strategies:

Speed Perturbation: Adjusting the audio speed increases the variety of discrete speech tokens, allowing the text segment to match a wider range of token combinations. In our study, we applied speed perturbation by altering the audio speed to 0.9 and 1.1 times the original speed.

Trigger Shift: Leveraging alignment biases to enhance the robustness of trigger tokens, we randomly shift trigger tokens by 1-4 frames with a probability of 30% during the training phase.

Time Masking: Time masking is applied to input tokens other than trigger tokens, encompassing both speech and text tokens, by substituting each token with a special padding token with a probability of 0.3.

Random De-duplication: Employing a randomized de-duplication approach reduces computational complexity while amplifying data diversity. Concurrently, during the decoding phase, we implement global de-duplication to further alleviate computational overhead.

Label Smoothing: Discretized speech tokens manifest intersections within clustering, resulting in losses during the speech discretization process. To address this issue, we adopt label smoothing [28],

\mathcal{L}_{\text{LS}}=\sum_{t=1}^{T_{s}}D_{\text{KL}}(q^{\prime}(x_{t}|x_{<t% })||p(x_{t}|x_{<t};\theta)),

(7)

where $q^{\prime}(x_{t}|x_{<t})$ is a soft label by label smoothing instead of a one-hot label, and $D_{\text{kl}}$ is the Kullback-Leibler divergence.

3 Experiments

3.1 Experimental Setup

Our implementation is based on Wenet [29], an open-source toolkit for End-to-End (E2E) speech recognition. Our ASR models employ a decoder-only architecture based on the transformer. We have established two model configurations: a small model (70 million parameters) without Large Language Model (LLM) initialization, compared with the previous non-streaming AEDs model, and a large model (310 million parameters) used to verify the feasibility of off-the-shelf LLM initialization. In future endeavors, we intend to incorporate LLMs with significantly more parameters for validation, such as models with 2 billion or larger parameters.

Dataset: In this study, we conduct experiments on two commonly used Chinese Mandarin corpora, 178-hour AISHELL-1 [26] and 1000-hour AISHELL-2 [27]. We report the character error rates (CER) of various models.

Discrete Speech Tokens: In this paper, we draw inspiration from a recent study [30] employing Canonical Correlation Analysis (CCA) to evaluate the similarity between layer representations and word labels. For the Chinese corpus, we select Chinese HuBERT large²²2https://huggingface.co/TencentGameMate/chinese-hubert-large. Subsequently, we choose layer 21 from the large models, as it demonstrates the highest CCA similarities with word labels. The number of K-Means clusters is set to 2,000, consistent with the previous method [24, 31].

Model Configuration: The small decoder-only transformer without LLM init comprises 8 blocks, each with 8 self-attention heads, with an attention dimension of 512, and a feed-forward network (FFN) with an intermediate hidden dimension of 1024. To explore the effectiveness of using off-the-shelf LLM init, we also adopt Qwen[4] as the backbone of our decoder-only Transformer model for discrete-token-based ASR systems. We use the Qwen2-0.5B³³3https://huggingface.co/Qwen/Qwen1.5-0.5B model (transformer with 310M parameters), which consists of 24 layers with the hidden size 1024, 16 attention heads, max sequence length of 1024. Note that we do not use the Qwen text tokenizer because it can cause token sparsity problems, which means there is not enough training data for some tokens. Instead, we directly discretize text into Chinese characters. There are a total of 7000 model units with 5000 commonly used Chinese char and 2000 speech tokens. In all experiments, we set the dropout rate to 0.1. Commonly during the training phase, we dynamically set $\Delta$ to be equal to the length of the speech segment corresponding to the next text token. During ASR decoding, we set the beam size to 10 and did not utilize language models in our experiments.

Table 1: CERs (%) on AISHELL-1 for different methods.

ID	Feature	Model type	Streaming	CER
ID	Feature	Model type	Streaming	dev	test
B1	SSL²	encoder-decoder	✗	3.8	4.0
B2	Fbank [32]	encoder-decoder	✗	4.2	4.5
B3	Discrete [31]	encoder-decoder	✗	4.6	4.9
Small
S1	Discrete	decoder-only	✗	5.9	6.2
S2	Discrete (TTI)	decoder-only	✓	9.4	9.8
S3	Discrete (BTI)	decoder-only	✓	6.1	6.4
Large (w/ Qwen-0.5B init)
L1	Discrete	decoder-only	✗	5.2	5.5
L2	Discrete (TTI)	decoder-only	✓	9.2	9.5
L3	Discrete (BTI)	decoder-only	✓	5.6	5.9

3.2 Main Results

Table 1 presents a CER on AISHELL-1. The first group in Table 1 presents results from previous studies, including semi-supervised learning (SSL) [19], E-Branchformer [32] based ASR model with continuous Fbank feature or discrete speech token [31] as input. The second group displays the results of our small model trained with our proposed two streaming methods using random initialization. In comparison to S3, the substitution errors of S2 have notably increased due to TTI’s design of text token insertion, which weakens its contextual modeling ability. It is evident that better model performance was achieved with S3 by decoupling boundary prediction and text prediction through BTI. The third group shows the results of the large model initializing with Qwen-0.5B LLM. With LLM initialization, we first train a non-streaming decoder-only model, which is model L1. Then we fine-tune the model L1 with our proposed streaming methods TTI and BTI, resulting in two streaming versions, L2 and L3. We observe that the large model with LLM initialization achieves a 5.9% CER on the AISHELL-1 test set. Upon comparing L2 with S2, it is noted that LLM initialization yields a relatively minor CER reduction for the TTI approach. This is attributed to the differing nature of interleaving speech and text compared to LLMs, resulting in only a relative CER benefit of 3.1% (9.8% $\to$ 9.5%). Conversely, employing BTI results in a 9.2% (6.5% $\to$ 5.9%) relative CER reduction on the test set compared to model S3. Notably, BTI not only exhibits superior CER performance but also demonstrates better adaptability to LLMs. Hence, in subsequent experiments, we employ the BTI approach.

3.3 Ablation Study

Table 2 shows the impact of right-chunk attention and various data augmentation based on model S3. Notably, the right-chunk attention has the most significant impact on the overall CER. Without right-chunk attention, the CER increases from 6.5% to 7.8% on the AISHELL-1 test set. The absence of right-chunk attention results in more substitution errors due to the limited contextual information available.

Among the five data pre-processing methods, label smoothing is the most effective method, resulting in a relative 9.7% CER reduction. We observe that when modeling speech and text tokens in a unified manner, predicting the next speech token is considerably easier than predicting the next text token. Consequently, the model tends to overfit during speech token prediction. Label smoothing effectively mitigates this overfitting issue. Additionally, other methods also contribute to some reduction in CER, ranging from 4.4% to 7.2%.

Table 2: Ablation of each component’s impact on CERs (%).

Method	CER
Method	dev	test
BTI (ours)	6.1	6.4
w/o right-chunk attention	7.4	7.8
w/o speed perturb	6.5	6.9
w/o trigger shift	6.4	6.7
w/o time mask	6.4	6.8
w/o random de-duplication	6.3	6.7
w/o label smoothing	6.8	7.2

3.4 Results on AISHELL-2

In Table 3, we present the performance on the AISHELL-2 corpus without speed perturbation. The top lines list the conventional FBank-based ASR system and the Hubert-Large discrete token-based ASR models. It is evident that using discrete token input with a decoder-only model yields slightly inferior performance compared to the encoder-decoder model (6.6% vs 6.9%). Meanwhile, the recognition accuracy of streaming models decreased by 4.1% (6.9% $\to$ 7.2%) compared to non-streaming results. We find that training ASR models using the discrete units on large-scale data can be quite efficient. We believe that as the amount of data and model parameters increase, the decoder-only model can completely surpass the traditional encoder-decoder model.

Table 3: CERs (%) on AISHELL-2 for different methods.

ID	Feature	Model type	Streaming	CER
B4	Fbank⁴⁴4https://github.com/wenet-e2e/wenet	encoder-decoder	✗	6.2
B5	Discrete	encoder-decoder	✗	6.6
Large (w/ Qwen-0.5B init)
L4	Discrete	decoder-only	✗	6.9
L5	Discrete (BTI)	decoder-only	✓	7.2

4 Conclusion and Future Work

In this work, we present a pilot study on the streaming decoder-only ASR with discrete speech units. We explore two approaches to achieving streaming decoder-only ASR: Text Token Insertion (TTI) and Boundary Token Insertion (BTI). Experimental results on AISHELL-1 and -2 show that the BTI method yields significantly better performance and competitive CER with the non-streaming decoder-only model. With the initialization of pretrained LLM, the performance of our proposed streaming decoder-only model can be further improved. As a pilot study, there remains considerable work to be explored in the follow-up. We will conduct experiments on more languages, larger datasets, and large-scale models. Note that in this study, we use HuBERT as our speech tokenizer, we will also compare more speech tokenizers in the future.

References

[1] OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.
[2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023.
[3] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “GLM: general language model pretraining with autoregressive blank infilling,” in the 60th Annual Meeting of the Association for ComputationalLinguistics,ACL 2022. Association for Computational Linguistics, 2022, pp. 320–335.
[4] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, and et al., “Qwen technical report,” CoRR, vol. abs/2309.16609, 2023.
[5] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using VQ-VAE and transformers,” CoRR, vol. abs/2104.10157, 2021.
[6] J. Y. Koh, R. Salakhutdinov, and D. Fried, “Grounding language models to images for multimodal inputs and outputs,” in International Conference on Machine Learning ,ICML 2023. PMLR, 2023, pp. 17 283–17 300.
[7] L. Yu, Y. Cheng, Z. Wang, V. Kumar, W. Macherey, Y. Huang, D. A. Ross, I. Essa, Y. Bisk, M. Yang, K. P. Murphy, A. G. Hauptmann, and L. Jiang, “SPAE: semantic pyramid autoencoder for multimodal generation with frozen llms,” in Neural Information Processing Systems,NeurIP, 2023.
[8] T. Wang, L. Zhou, Z. Zhang, Y. Wu, S. Liu, Y. Gaur, Z. Chen, J. Li, and F. Wei, “Viola: Unified codec language models for speech recognition, synthesis, and translation,” CoRR, vol. abs/2305.16107, 2023.
[9] D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” in EMNLP 2023. ACL, 2023, pp. 15 757–15 773.
[10] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, H. Muckenhirn, D. Padfield, and et al., “Audiopalm: A large language model that can speak and listen,” CoRR, vol. abs/2306.12925, 2023.
[11] J. Wang, Z. Du, Q. Chen, Y. Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Ma, W. Wang, S. Zheng, C. Zhou, Z. Yan, and S. Zhang, “Lauragpt: Listen, attend, understand, and regenerate audio with GPT,” CoRR, vol. abs/2310.04673, 2023.
[12] Y. Hu, C. Chen, C. H. Yang, R. Li, C. Zhang, P. Chen, and E. S. Chng, “Large language models are efficient learners of noise-robust speech recognition,” CoRR, vol. abs/2401.10446, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.10446
[13] C. Chen, R. Li, Y. Hu, S. M. Siniscalchi, P. Chen, E. S. Chng, and C. H. Yang, “It’s never too late: Fusing acoustic information into large language models for automatic speech recognition,” CoRR, vol. abs/2402.05457, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.05457
[14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
[15] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” CoRR, vol. abs/2311.07919, 2023.
[16] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: towards generic hearing abilities for large language models,” CoRR, vol. abs/2310.13289, 2023.
[17] J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu, and Y. Wu, “On decoder-only architecture for speech-to-text and large language model integration,” in IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023. IEEE, 2023, pp. 1–8.
[18] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” CoRR, vol. abs/2210.13438, 2022.
[19] W. Hsu, B. Bolte, Y. H. Tsai, and et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
[20] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Cheng, C. Cherry, L. Gonzalez, and et al., “Palm 2 technical report,” CoRR, vol. abs/2305.10403, 2023.
[21] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, Z. Meng, K. Hu, A. Rosenberg, R. Prabhavalkar, D. S. Park, P. Haghani, J. Riesa, G. Perng, H. Soltau, T. Strohman, F. Beaufays, Y. Wu, and et al., “Google USM: scaling automatic speech recognition beyond 100 languages,” CoRR, vol. abs/2303.01037, 2023.
[22] S. Arora, G. Saon, S. Watanabe, and B. Kingsbury, “Semi-autoregressive streaming ASR with label context,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024. IEEE, 2024.
[23] Q. Li, B. Li, D. Hwang, T. N. Sainath, and P. M. Mengibar, “Modular domain adaptation for conformer-based streaming ASR,” CoRR, vol. abs/2305.13408, 2023.
[24] Q. Chen, W. Wang, Q. Zhang, S. Zheng, S. Zhang, C. Deng, Y. Ma, H. Yu, J. Liu, and C. Zhang, “Loss masking is not needed in decoder-only transformer for discrete-token based ASR,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024. IEEE, 2024.
[25] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[26] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA 2017. IEEE, 2017, pp. 1–5.
[27] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: transforming mandarin ASR research into industrial scale,” CoRR, vol. abs/1808.10583, 2018.
[28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, 2016, pp. 2818–2826.
[29] B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, and J. Niu, “Wenet 2.0: More productive end-to-end speech recognition toolkit,” in Interspeech 2022. ISCA, 2022, pp. 1661–1665.
[30] A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023. IEEE, 2023, pp. 1–5.
[31] X. Chang, B. Yan, K. Choi, J. Jung, Y. Lu, S. Maiti, R. S. Sharma, J. Shi, J. Tian, S. Watanabe, Y. Fujita, T. Maekaku, P. Guo, Y. Cheng, P. Denisov, K. Saijo, and H. Wang, “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” in IEEE International Conference on Acoustics, Speech and Signal Processing,ICASSP 2024. IEEE, pp. 11 481–11 485.
[32] K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-branchformer: Branchformer with enhanced merging for speech recognition,” in IEEE Spoken Language Technology Workshop,SLT 2022. IEEE, 2022, pp. 84–91.