\interspeechcameraready\name

[affiliation=1,2]PeikunChen \name[affiliation=2]SiningSun \name[affiliation=2]ChanghaoShan \name[affiliation=2]QingYang \name[affiliation=1*]LeiXie

Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study

Abstract

Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model’s contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts. The code we used for this work can be found here111https://github.com/chenpk00/IS2024_stream_decoder_only_asr.

keywords:
streaming automatic speech recognition, discrete-token, decoder-only Transformer

1 Introduction

Recently, large language models (LLMs) [1, 2, 3, 4] have made great progress in various natural language processing (NLP) tasks. Methods based on LLMs are also leading a revolution in other research fields, such as computing vision [5, 6, 7] and speech [8, 9, 10, 11]. Thanks to the powerful understanding of language and generalization ability of an LLM, transferring LLM pretrained from massive amounts of text data to speech recognition tasks can also bring significant word error rate reduction [12, 13]. As for the model architecture, most of the LLMs are based on decoder-only transformers [14], which simplifies the model structure used for ASR tasks. Therefore, LLM-based ASR is becoming a hot research topic in the field.

Currently, there are two primary approaches to integrating the speech modality with LLMs in LLM-based ASR. One approach involves directly integrating continuous features with text embedding through a trainable adaptor [15, 16, 17]. This kind of approach introduces additional acoustic encoders and models speech and text separately. Conversely, the other approach is to treat speech representation as textual tokens and employ a decoder-only model to optimize multi-modal tasks effectively. For instance, VioLA [8] converts continuous speech signals to discrete codec codes via EnCodec [18] and unifies several speech-related tasks into a conditional language modeling task. SpeechGPT [9] employs LLaMA [2] as its foundational framework, utilizing k𝑘kitalic_k-means clustering derived from Hubert [19] to tokenize speech signal. Similarly, AudioPaLM [10] utilizes PaLM-2 [20] as its underlying architecture and extracts discrete tokens from the encoder of Universal Speech Model [21]. Such a unified modeling approach has been demonstrated to effectively improve the ASR performance. The decoder-only transformer model with unified discrete input provides a new paradigm for various speech-related tasks, including but not limited to speech recognition and speech synthesis.

Previous works on decoder-only ASR tasks mainly focused on non-streaming scenarios  [8, 10]. However, real-time streaming recognition can give faster recognition results and a better user experience in real-world applications. Many works have been proposed to build a faster and better streaming ASR system in the past decades based on various end-to-end speech recognition frameworks [22, 23]. However, the exploration of streaming decoder-only speech recognition is very limited. As decoder-only-based ASR model performance improves and the number of model parameters increases, streaming low-latency inference becomes a challenging task.

This paper presents a pilot study on the streaming decoder-only transformer ASR model. Current non-streaming decoder-only transformer ASR models learn to predict text tokens autoregressively using the whole speech utterance [24]. For the streaming version, it is necessary to emit text tokens with minimal delay as the corresponding speech segment is received. To this end, we investigate two approaches based on the speech-to-text alignment obtained by a GMM-HMM model [25]. Specifically, the first approach, Text Token Insertion (TTI), inserts the corresponding text tokens into the speech token sequences directly under the guide of speech-to-text alignment during training. By contrast, in Boundary Token Insertion (BTI), special “boundary tokens” are inserted into the speech token sequences in the same way with text tokens added at the end, effectively decoupling the speech and text modalities. Upon triggering a boundary token, the corresponding text token can be generated through a one-step inference process autoregressively. Meanwhile, we introduce right-chunk attention and various data augmentation techniques to improve the streaming model’s contextual modeling ability. We also explore the efficacy of leveraging an off-the-shelf text LLM to initialize our streaming ASR model.

Refer to caption
(a) Non-streaming model
Refer to caption
(b) TTI streaming model
Refer to caption
(c) BTI streaming model
Refer to caption
Figure 1: The comparison of different methods for training discrete-token-based decoder-only Transformer for ASR. (a) non-streaming model: decoding after receiving the whole speech token; (b) Text token insertion (TTI) streaming model: inserts text tokens into speech token sequences directly under the guide of speech-to-text alignment; (3) Boundary token insertion (BTI) streaming model: insert “boundary tokens” into the discrete speech token sequence.

Experiments show that our proposed streaming decoder-only model can obtain 5.9% and 7.2% character error rate (CER) on two Chinese Mandarin corpora, AISHELL-1 [26] and AISHELL-2 [27], respectively, showing competitive performance with the non-streaming counterpart. Our results also show that the streaming decoder-only Transformer ASR model can benefit from the initialization from an off-the-shelf text LLM, such as Qwen [4].

2 Proposed Method

2.1 Streaming decoder-only model architecture

Figure 1 illustrates three types of decoder-only models designed for ASR tasks based on discrete speech token input. Among them, Figure 1 (a) represents a non-streaming model proposed in [24], whereas Figures 1 (b) and (c) depict two variations of streaming frameworks. Both models (b) and (c) necessitate force alignment between speech and text. As shown in equation (1), given discrete speech token sequence x=(x1,,xt,xT)𝑥subscript𝑥1subscript𝑥𝑡subscript𝑥𝑇x=(x_{1},...,x_{t},...x_{T})italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) and the correspoing text token sequence y=(y1,,yL)𝑦𝑦1subscript𝑦𝐿y=(y1,...,y_{L})italic_y = ( italic_y 1 , … , italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), streaming ASR model is optimized by maximizing the conditional probability, where tyi+Δsubscript𝑡subscript𝑦𝑖Δt_{y_{i}+\Delta}italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT is the time of emitting text token yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ΔΔ\Deltaroman_Δ is a constant, which means how many right context tokens can be used, xtyi+Δ=(x1,,xtyi+Δ)subscript𝑥absentsubscript𝑡subscript𝑦𝑖Δsubscript𝑥1subscript𝑥subscript𝑡subscript𝑦𝑖Δx_{\leq t_{y_{i}}+\Delta}=({x_{1},...,x_{t_{y_{i}}+\Delta}})italic_x start_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT ) and θ𝜃\thetaitalic_θ is the trainable model parameter.

p(y|x;θ)=1Lp(yixtyi+Δ,θ).𝑝conditional𝑦𝑥𝜃superscriptsubscriptproduct1𝐿𝑝conditionalsubscript𝑦𝑖subscript𝑥absentsubscript𝑡subscript𝑦𝑖Δ𝜃p(y|x;\theta)={\textstyle\prod_{1}^{L}}p\left(y_{i}\mid x_{\leq t_{y_{i}+% \Delta}},\theta\right).italic_p ( italic_y | italic_x ; italic_θ ) = ∏ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ ) . (1)

In Figure 1 (b), the Text Token Insertion (TTI) approach showcases the interleaving of discrete speech and text tokens, with text tokens inserted into speech tokens at the end of the corresponding speech segment. Mathematically, it equals to optimize equation (1) directly. However, the mix of text and speech tokens complicates the use of beam search during decoding. During inference, speech tokens can be treated as conditions. Given the interleaved nature of text and speech tokens, triggering a text token during beam search necessitates caching all historical hidden states (e.g., key and value of self-attention) for each search path.

Figure 1 (c) illustrates the Boundary Token Insertion (BTI) approach, where a special token, instead of a text token, is inserted into the speech token sequence, effectively decoupling the text and speech token sequences. In contrast to Figure 1 (b), this process can be viewed as comprising two stages: the first stage involves determining the boundary position, while the second stage entails predicting the corresponding specific text token conditioned on the history of speech tokens. Equation (2) provides a formal definition. Here, a hidden variable b𝑏bitalic_b is introduced, where b=(b1,,bT)0,1Tformulae-sequence𝑏subscript𝑏1subscript𝑏𝑇0superscript1𝑇b=(b_{1},...,b_{T})\in{0,1}^{T}italic_b = ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∈ 0 , 1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT represents one of the possible boundary paths, and β𝛽\betaitalic_β denotes the set of all possible paths. However, in practice, optimizing by summing over all possible paths is computationally challenging. Therefore, we opt to approximate this optimization problem by selecting the most probable path bpsubscript𝑏𝑝b_{p}italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

p(yx,θ)=bβp(y,bx,θ)=bβ1Lp(yi,bxtyi+Δ,θ)1Lp(yibp,xtyi+Δ,θ)p(bpxtyi+Δ,θ),\begin{split}p\left(y\mid x,\theta\right)&={\sum_{b\in\beta}}p\left(y,b\mid x,% \theta\right)\\ &={\sum_{b\in\beta}}{\textstyle\prod_{1}^{L}}p\left(y_{i},b\mid x\leq t_{y_{i}% +\Delta},\theta\right)\\ &\approx{\textstyle\prod_{1}^{L}}p\left(y_{i}\mid b_{p},x_{\leq}t_{y_{i}+% \Delta},\theta\right)p\left(b_{p}\mid x_{\leq}t_{y_{i}+\Delta},\theta\right),% \\ \end{split}start_ROW start_CELL italic_p ( italic_y ∣ italic_x , italic_θ ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_b ∈ italic_β end_POSTSUBSCRIPT italic_p ( italic_y , italic_b ∣ italic_x , italic_θ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_b ∈ italic_β end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_b ∣ italic_x ≤ italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT , italic_θ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≈ ∏ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT ≤ end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT , italic_θ ) italic_p ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT ≤ end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT , italic_θ ) , end_CELL end_ROW (2)
Refer to caption
(a) global attention
Refer to caption
(b) causal attention
Refer to caption
(c) right-chunk attention
Figure 2: Example diagram of different attention mechanisms. The green blocks indicate the part of the LLM. The yellow triangle indicates the part of the attention area. (a) global attention (b) causal attention; (c) right-chunk attention.

2.2 Right-chunk attention

The previous non-streaming decoder-only ASR models predict the text tokens aggressively with the entire discrete speech tokens. Illustrated in Figure 2 (a), the speech tokens primarily attend to each other, while the text tokens focus on all discrete speech tokens and preceding text tokens.

To achieve streaming speech recognition, we introduce a causal attention mechanism. As shown in the formula (3), a casual mask is applied during self-attention calculation, where M𝑀Mitalic_M is the mask.

Attention(Q,K,V)=softmax(QKMdk)V,Attention𝑄𝐾𝑉softmax𝑄𝐾𝑀subscript𝑑𝑘𝑉\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QKM}{\sqrt{d_{k}}}\right){V},Attention ( italic_Q , italic_K , italic_V ) = softmax ( divide start_ARG italic_Q italic_K italic_M end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V , (3)

We divide attention mask M𝑀Mitalic_M into two parts, denoted as Mspeechsubscript𝑀𝑠𝑝𝑒𝑒𝑐M_{speech}italic_M start_POSTSUBSCRIPT italic_s italic_p italic_e italic_e italic_c italic_h end_POSTSUBSCRIPT and Mtextsubscript𝑀𝑡𝑒𝑥𝑡M_{text}italic_M start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT, representing the attention mask used for speech and text tokens, respectively. Equations (4) and (5) provide the definitions of Mspeechsubscript𝑀speechM_{\text{speech}}italic_M start_POSTSUBSCRIPT speech end_POSTSUBSCRIPT and Mtextsubscript𝑀textM_{\text{text}}italic_M start_POSTSUBSCRIPT text end_POSTSUBSCRIPT.

Mspeech(i,j)={True ijFalse  otherwise ,subscript𝑀𝑠𝑝𝑒𝑒𝑐𝑖𝑗casesTrue 𝑖𝑗False  otherwise M_{speech}(i,j)=\left\{\begin{array}[]{ll}\text{True }&i\leq j\\ \text{False }&\text{ otherwise }\end{array}\right.,italic_M start_POSTSUBSCRIPT italic_s italic_p italic_e italic_e italic_c italic_h end_POSTSUBSCRIPT ( italic_i , italic_j ) = { start_ARRAY start_ROW start_CELL True end_CELL start_CELL italic_i ≤ italic_j end_CELL end_ROW start_ROW start_CELL False end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY , (4)

where 0iT+L0𝑖𝑇𝐿0\leq i\leq T+L0 ≤ italic_i ≤ italic_T + italic_L because of the insertion of boundary tokens in the speech. Meanwhile, as shown in the figure 1 (b), masktext𝑚𝑎𝑠subscript𝑘𝑡𝑒𝑥𝑡mask_{text}italic_m italic_a italic_s italic_k start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT can be represented by

Mtext(i,j)={True jtyiorT+L<ijT+2LFalse otherwise ,subscript𝑀𝑡𝑒𝑥𝑡𝑖𝑗casesTrue 𝑗subscript𝑡subscript𝑦𝑖missing-subexpression𝑜𝑟𝑇𝐿𝑖𝑗𝑇2𝐿False otherwise \begin{split}M_{text}(i,j)=\left\{\begin{array}[]{ll}\text{True }&j\leq t_{y_{% i}}\\ &or\ T+L<i\leq j\leq T+2L\\ \text{False }&\text{otherwise }\end{array}\right.,\end{split}start_ROW start_CELL italic_M start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) = { start_ARRAY start_ROW start_CELL True end_CELL start_CELL italic_j ≤ italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_o italic_r italic_T + italic_L < italic_i ≤ italic_j ≤ italic_T + 2 italic_L end_CELL end_ROW start_ROW start_CELL False end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY , end_CELL end_ROW (5)

where Mtextsubscript𝑀𝑡𝑒𝑥𝑡M_{text}italic_M start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT consists of two yellow triangles. At each time step, the current text focuses on both the preceding speech and text.

Unlike non-streaming ASR models, streaming ASR models face limitations in considering global information, leading to weakened contextual modeling capabilities. To address this issue, we integrate the right-chunk attention mechanism, illustrated in Figure 2 (c). Unlike causal attention, right-chunk attention in the text portion enables capturing more speech information.

Mtext(i,j)={True jtyi+ΔorT+L<ijT+2LFalse otherwise ,subscript𝑀𝑡𝑒𝑥𝑡𝑖𝑗casesTrue 𝑗subscript𝑡subscript𝑦𝑖Δmissing-subexpression𝑜𝑟𝑇𝐿𝑖𝑗𝑇2𝐿False otherwise \begin{split}M_{text}(i,j)=\left\{\begin{array}[]{ll}\text{True }&j\leq t_{y_{% i}+\Delta}\\ &or\ T+L<i\leq j\leq T+2L\\ \text{False }&\text{otherwise }\end{array}\right.,\end{split}start_ROW start_CELL italic_M start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_i , italic_j ) = { start_ARRAY start_ROW start_CELL True end_CELL start_CELL italic_j ≤ italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Δ end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_o italic_r italic_T + italic_L < italic_i ≤ italic_j ≤ italic_T + 2 italic_L end_CELL end_ROW start_ROW start_CELL False end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY , end_CELL end_ROW (6)

where ΔΔ\Deltaroman_Δ means how many speech tokens in the right context can be used. Larger ΔΔ\Deltaroman_Δ also means a higher latency.

2.3 Data Pre-processing

Another significant improvement lies in the pre-processing steps. Throughout the training and decoding stages, we observed a tendency for the model to more readily overfit discrete features compared to continuous features, thus significantly impacting the final results. To address this concern, we employed data pre-processing, which plays a crucial role in model training and can substantially enhance data diversity and quantity. To improve the model’s robustness, we implemented the following strategies:

Speed Perturbation: Adjusting the audio speed increases the variety of discrete speech tokens, allowing the text segment to match a wider range of token combinations. In our study, we applied speed perturbation by altering the audio speed to 0.9 and 1.1 times the original speed.

Trigger Shift: Leveraging alignment biases to enhance the robustness of trigger tokens, we randomly shift trigger tokens by 1-4 frames with a probability of 30% during the training phase.

Time Masking: Time masking is applied to input tokens other than trigger tokens, encompassing both speech and text tokens, by substituting each token with a special padding token with a probability of 0.3.

Random De-duplication: Employing a randomized de-duplication approach reduces computational complexity while amplifying data diversity. Concurrently, during the decoding phase, we implement global de-duplication to further alleviate computational overhead.

Label Smoothing: Discretized speech tokens manifest intersections within clustering, resulting in losses during the speech discretization process. To address this issue, we adopt label smoothing [28],

LS=t=1TsDKL(q(xt|x<t)||p(xt|x<t;θ)),\mathcal{L}_{\text{LS}}=\sum_{t=1}^{T_{s}}D_{\text{KL}}(q^{\prime}(x_{t}|x_{<t% })||p(x_{t}|x_{<t};\theta)),caligraphic_L start_POSTSUBSCRIPT LS end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) | | italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ; italic_θ ) ) , (7)

where q(xt|x<t)superscript𝑞conditionalsubscript𝑥𝑡subscript𝑥absent𝑡q^{\prime}(x_{t}|x_{<t})italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) is a soft label by label smoothing instead of a one-hot label, and Dklsubscript𝐷klD_{\text{kl}}italic_D start_POSTSUBSCRIPT kl end_POSTSUBSCRIPT is the Kullback-Leibler divergence.

3 Experiments

3.1 Experimental Setup

Our implementation is based on Wenet [29], an open-source toolkit for End-to-End (E2E) speech recognition. Our ASR models employ a decoder-only architecture based on the transformer. We have established two model configurations: a small model (70 million parameters) without Large Language Model (LLM) initialization, compared with the previous non-streaming AEDs model, and a large model (310 million parameters) used to verify the feasibility of off-the-shelf LLM initialization. In future endeavors, we intend to incorporate LLMs with significantly more parameters for validation, such as models with 2 billion or larger parameters.

Dataset: In this study, we conduct experiments on two commonly used Chinese Mandarin corpora, 178-hour AISHELL-1 [26] and 1000-hour AISHELL-2 [27]. We report the character error rates (CER) of various models.

Discrete Speech Tokens: In this paper, we draw inspiration from a recent study [30] employing Canonical Correlation Analysis (CCA) to evaluate the similarity between layer representations and word labels. For the Chinese corpus, we select Chinese HuBERT large222https://huggingface.co/TencentGameMate/chinese-hubert-large. Subsequently, we choose layer 21 from the large models, as it demonstrates the highest CCA similarities with word labels. The number of K-Means clusters is set to 2,000, consistent with the previous method [24, 31].

Model Configuration: The small decoder-only transformer without LLM init comprises 8 blocks, each with 8 self-attention heads, with an attention dimension of 512, and a feed-forward network (FFN) with an intermediate hidden dimension of 1024. To explore the effectiveness of using off-the-shelf LLM init, we also adopt Qwen[4] as the backbone of our decoder-only Transformer model for discrete-token-based ASR systems. We use the Qwen2-0.5B333https://huggingface.co/Qwen/Qwen1.5-0.5B model (transformer with 310M parameters), which consists of 24 layers with the hidden size 1024, 16 attention heads, max sequence length of 1024. Note that we do not use the Qwen text tokenizer because it can cause token sparsity problems, which means there is not enough training data for some tokens. Instead, we directly discretize text into Chinese characters. There are a total of 7000 model units with 5000 commonly used Chinese char and 2000 speech tokens. In all experiments, we set the dropout rate to 0.1. Commonly during the training phase, we dynamically set ΔΔ\Deltaroman_Δ to be equal to the length of the speech segment corresponding to the next text token. During ASR decoding, we set the beam size to 10 and did not utilize language models in our experiments.

Table 1: CERs (%) on AISHELL-1 for different methods.
ID Feature Model type Streaming CER
dev test
B1 SSL 2 encoder-decoder 3.8 4.0
B2 Fbank [32] encoder-decoder 4.2 4.5
B3 Discrete [31] encoder-decoder 4.6 4.9
Small
S1 Discrete decoder-only 5.9 6.2
S2 Discrete (TTI) decoder-only 9.4 9.8
S3 Discrete (BTI) decoder-only 6.1 6.4
Large (w/ Qwen-0.5B init)
L1 Discrete decoder-only 5.2 5.5
L2 Discrete (TTI) decoder-only 9.2 9.5
L3 Discrete (BTI) decoder-only 5.6 5.9

3.2 Main Results

Table 1 presents a CER on AISHELL-1. The first group in Table 1 presents results from previous studies, including semi-supervised learning (SSL) [19], E-Branchformer [32] based ASR model with continuous Fbank feature or discrete speech token [31] as input. The second group displays the results of our small model trained with our proposed two streaming methods using random initialization. In comparison to S3, the substitution errors of S2 have notably increased due to TTI’s design of text token insertion, which weakens its contextual modeling ability. It is evident that better model performance was achieved with S3 by decoupling boundary prediction and text prediction through BTI. The third group shows the results of the large model initializing with Qwen-0.5B LLM. With LLM initialization, we first train a non-streaming decoder-only model, which is model L1. Then we fine-tune the model L1 with our proposed streaming methods TTI and BTI, resulting in two streaming versions, L2 and L3. We observe that the large model with LLM initialization achieves a 5.9% CER on the AISHELL-1 test set. Upon comparing L2 with S2, it is noted that LLM initialization yields a relatively minor CER reduction for the TTI approach. This is attributed to the differing nature of interleaving speech and text compared to LLMs, resulting in only a relative CER benefit of 3.1% (9.8%\to9.5%). Conversely, employing BTI results in a 9.2% (6.5%\to5.9%) relative CER reduction on the test set compared to model S3. Notably, BTI not only exhibits superior CER performance but also demonstrates better adaptability to LLMs. Hence, in subsequent experiments, we employ the BTI approach.

3.3 Ablation Study

Table 2 shows the impact of right-chunk attention and various data augmentation based on model S3. Notably, the right-chunk attention has the most significant impact on the overall CER. Without right-chunk attention, the CER increases from 6.5% to 7.8% on the AISHELL-1 test set. The absence of right-chunk attention results in more substitution errors due to the limited contextual information available.

Among the five data pre-processing methods, label smoothing is the most effective method, resulting in a relative 9.7% CER reduction. We observe that when modeling speech and text tokens in a unified manner, predicting the next speech token is considerably easier than predicting the next text token. Consequently, the model tends to overfit during speech token prediction. Label smoothing effectively mitigates this overfitting issue. Additionally, other methods also contribute to some reduction in CER, ranging from 4.4% to 7.2%.

Table 2: Ablation of each component’s impact on CERs (%).
Method CER
dev test
BTI (ours) 6.1 6.4
    w/o right-chunk attention 7.4 7.8
    w/o speed perturb 6.5 6.9
    w/o trigger shift 6.4 6.7
    w/o time mask 6.4 6.8
    w/o random de-duplication 6.3 6.7
    w/o label smoothing 6.8 7.2

3.4 Results on AISHELL-2

In Table 3, we present the performance on the AISHELL-2 corpus without speed perturbation. The top lines list the conventional FBank-based ASR system and the Hubert-Large discrete token-based ASR models. It is evident that using discrete token input with a decoder-only model yields slightly inferior performance compared to the encoder-decoder model (6.6% vs 6.9%). Meanwhile, the recognition accuracy of streaming models decreased by 4.1% (6.9%\to7.2%) compared to non-streaming results. We find that training ASR models using the discrete units on large-scale data can be quite efficient. We believe that as the amount of data and model parameters increase, the decoder-only model can completely surpass the traditional encoder-decoder model.

Table 3: CERs (%) on AISHELL-2 for different methods.
ID Feature Model type Streaming CER
B4 Fbank444https://github.com/wenet-e2e/wenet encoder-decoder 6.2
B5 Discrete encoder-decoder 6.6
Large (w/ Qwen-0.5B init)
L4 Discrete decoder-only 6.9
L5 Discrete (BTI) decoder-only 7.2

4 Conclusion and Future Work

In this work, we present a pilot study on the streaming decoder-only ASR with discrete speech units. We explore two approaches to achieving streaming decoder-only ASR: Text Token Insertion (TTI) and Boundary Token Insertion (BTI). Experimental results on AISHELL-1 and -2 show that the BTI method yields significantly better performance and competitive CER with the non-streaming decoder-only model. With the initialization of pretrained LLM, the performance of our proposed streaming decoder-only model can be further improved. As a pilot study, there remains considerable work to be explored in the follow-up. We will conduct experiments on more languages, larger datasets, and large-scale models. Note that in this study, we use HuBERT as our speech tokenizer, we will also compare more speech tokenizers in the future.

References

  • [1] OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.
  • [2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023.
  • [3] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “GLM: general language model pretraining with autoregressive blank infilling,” in the 60th Annual Meeting of the Association for ComputationalLinguistics,ACL 2022.   Association for Computational Linguistics, 2022, pp. 320–335.
  • [4] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, and et al., “Qwen technical report,” CoRR, vol. abs/2309.16609, 2023.
  • [5] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using VQ-VAE and transformers,” CoRR, vol. abs/2104.10157, 2021.
  • [6] J. Y. Koh, R. Salakhutdinov, and D. Fried, “Grounding language models to images for multimodal inputs and outputs,” in International Conference on Machine Learning ,ICML 2023.   PMLR, 2023, pp. 17 283–17 300.
  • [7] L. Yu, Y. Cheng, Z. Wang, V. Kumar, W. Macherey, Y. Huang, D. A. Ross, I. Essa, Y. Bisk, M. Yang, K. P. Murphy, A. G. Hauptmann, and L. Jiang, “SPAE: semantic pyramid autoencoder for multimodal generation with frozen llms,” in Neural Information Processing Systems,NeurIP, 2023.
  • [8] T. Wang, L. Zhou, Z. Zhang, Y. Wu, S. Liu, Y. Gaur, Z. Chen, J. Li, and F. Wei, “Viola: Unified codec language models for speech recognition, synthesis, and translation,” CoRR, vol. abs/2305.16107, 2023.
  • [9] D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” in EMNLP 2023.   ACL, 2023, pp. 15 757–15 773.
  • [10] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov, H. Muckenhirn, D. Padfield, and et al., “Audiopalm: A large language model that can speak and listen,” CoRR, vol. abs/2306.12925, 2023.
  • [11] J. Wang, Z. Du, Q. Chen, Y. Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Ma, W. Wang, S. Zheng, C. Zhou, Z. Yan, and S. Zhang, “Lauragpt: Listen, attend, understand, and regenerate audio with GPT,” CoRR, vol. abs/2310.04673, 2023.
  • [12] Y. Hu, C. Chen, C. H. Yang, R. Li, C. Zhang, P. Chen, and E. S. Chng, “Large language models are efficient learners of noise-robust speech recognition,” CoRR, vol. abs/2401.10446, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2401.10446
  • [13] C. Chen, R. Li, Y. Hu, S. M. Siniscalchi, P. Chen, E. S. Chng, and C. H. Yang, “It’s never too late: Fusing acoustic information into large language models for automatic speech recognition,” CoRR, vol. abs/2402.05457, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.05457
  • [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
  • [15] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” CoRR, vol. abs/2311.07919, 2023.
  • [16] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: towards generic hearing abilities for large language models,” CoRR, vol. abs/2310.13289, 2023.
  • [17] J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu, and Y. Wu, “On decoder-only architecture for speech-to-text and large language model integration,” in IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023.   IEEE, 2023, pp. 1–8.
  • [18] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” CoRR, vol. abs/2210.13438, 2022.
  • [19] W. Hsu, B. Bolte, Y. H. Tsai, and et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
  • [20] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, E. Chu, J. H. Clark, L. E. Shafey, Y. Huang, K. Meier-Hellstern, G. Mishra, E. Moreira, M. Omernick, K. Robinson, S. Ruder, Y. Tay, K. Xiao, Y. Cheng, C. Cherry, L. Gonzalez, and et al., “Palm 2 technical report,” CoRR, vol. abs/2305.10403, 2023.
  • [21] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, Z. Meng, K. Hu, A. Rosenberg, R. Prabhavalkar, D. S. Park, P. Haghani, J. Riesa, G. Perng, H. Soltau, T. Strohman, F. Beaufays, Y. Wu, and et al., “Google USM: scaling automatic speech recognition beyond 100 languages,” CoRR, vol. abs/2303.01037, 2023.
  • [22] S. Arora, G. Saon, S. Watanabe, and B. Kingsbury, “Semi-autoregressive streaming ASR with label context,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024.   IEEE, 2024.
  • [23] Q. Li, B. Li, D. Hwang, T. N. Sainath, and P. M. Mengibar, “Modular domain adaptation for conformer-based streaming ASR,” CoRR, vol. abs/2305.13408, 2023.
  • [24] Q. Chen, W. Wang, Q. Zhang, S. Zheng, S. Zhang, C. Deng, Y. Ma, H. Yu, J. Liu, and C. Zhang, “Loss masking is not needed in decoder-only transformer for discrete-token based ASR,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024.   IEEE, 2024.
  • [25] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989.
  • [26] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA 2017.   IEEE, 2017, pp. 1–5.
  • [27] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: transforming mandarin ASR research into industrial scale,” CoRR, vol. abs/1808.10583, 2018.
  • [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016.   IEEE Computer Society, 2016, pp. 2818–2826.
  • [29] B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, and J. Niu, “Wenet 2.0: More productive end-to-end speech recognition toolkit,” in Interspeech 2022.   ISCA, 2022, pp. 1661–1665.
  • [30] A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023.   IEEE, 2023, pp. 1–5.
  • [31] X. Chang, B. Yan, K. Choi, J. Jung, Y. Lu, S. Maiti, R. S. Sharma, J. Shi, J. Tian, S. Watanabe, Y. Fujita, T. Maekaku, P. Guo, Y. Cheng, P. Denisov, K. Saijo, and H. Wang, “Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study,” in IEEE International Conference on Acoustics, Speech and Signal Processing,ICASSP 2024.   IEEE, pp. 11 481–11 485.
  • [32] K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, “E-branchformer: Branchformer with enhanced merging for speech recognition,” in IEEE Spoken Language Technology Workshop,SLT 2022.   IEEE, 2022, pp. 84–91.