Contextualized Automatic Speech Recognition
with Attention-Based Bias Phrase Boosted Beam Search
Abstract
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.
Index Terms— speech recognition, attention, contextualization, biasing, beam search
1 Introduction
End-to-end (E2E) automatic speech recognition (ASR) [1, 2] methods directly convert acoustic feature sequences to token sequences without requiring the multiple components used in conventional ASR systems, such as acoustic models (AM) and language models (LM). Various E2E-ASR methods have been proposed previously, including connectionist temporal classification (CTC) [3], recurrent neural network transducer (RNN-T) [4], attention mechanism [5, 6], and their various hybrid systems [7, 8, 9]. Since the effectiveness of E2E-ASR methods is inherently related to the context in the training data, performance expectations may not be satisfied consistently for the given user context. For example, personal names and technical terms tend to be important keywords in different contexts, but such terms may not appear frequently in the available training data, which would result in poor recognition accuracy. It is impractical to train a model for all contexts during training; thus, the user or developer should be able to contextualize the model easily without training.
A typical approach to this problem is shallow fusion using an external LM [10, 11, 12, 13, 14]. For example, [10, 11, 12] used a weighted finite state transducer (WFST) to construct an in-class LM to facilitate contextualization for the target named entities. Neural LM fusion methods have been also proposed [13, 14]. The LM fusion technique attempts to enhance accuracy by combining an E2E-ASR model with an external neural LM and then rescoring the hypotheses generated by the E2E-ASR model. However, whether employing WFST or neural LMs, training an external LM requires additional training steps.
Thus, several methods have been proposed that do not require retraining. These methods include knowledge graph modeling [15] for recognizing out-of-vocabulary named entities, contextual spelling correction [15] using an editable phrase list, and named entity aware ASR model [16] that recognize specific named entities based on phoneme similarity. However, these methods have limitations, such as requiring a speech synthesis (TTS) model for training and not being able to handle words other than predefined target named entities.
Deep biasing methods[17, 18, 19, 20] provide an alternative approach to realize effective contextualization without requiring retraining processes and TTS models. In such methods, the E2E-ASR model can be contextualized using an editable phrase list, which is referred to as a bias list in this paper. Most deep biasing methods implement a cross-attention layer between the bias list and input sequences to recognize the bias phrases correctly. However, it has been observed that simply adding a cross-attention layer for the bias list is not effective [21]. Thus, [21, 22] introduced an additional branch designed to detect bias phrases, which indirectly helps to update the parameters of the cross-attention layer through an auxiliary loss. In contrast, [23, 24] introduced an auxiliary loss function directly on the cross-attention layer (referred to as bias phrase index loss and will be described in Section 3.2), which detects to the bias phrase index. While this approach allows for a direct parameter update of the cross-attention layer, it cannot distinguish whether the output tokens come from the bias list or not. In addition, [23] requires two-stage training using a pretrained ASR model, which is time consuming.
This paper proposes a deep biasing method that employs both an auxiliary loss directly on the cross-attention layer, termed as bias phrase index loss, and special tokens for bias phrases to realize more effective bias phrase detection. Unlike conventional indirect methods [21, 22], our method facilitates the effective training of the cross-attention layer through the bias phrase index loss. Additionally, our technique departs from current methods [23] by introducing special tokens for bias phrases. This allows the model to focus on the bias phrases more effectively, eliminating the need for a two-stage training process. Furthermore, we propose a bias phrase boosted (BPB) beam search algorithm that integrates the bias phrase index probability during inference, augmenting the performance in bias phrase recognition. The main contributions of this study are as follows:
-
•
We propose a deep biasing model that utilizes both bias phrase index loss and special tokens for the bias phrases.
-
•
We propose a bias phrase boosted (BPB) beam search algorithm to further improve the performance for the target phrases.
-
•
We demonstrate that the proposed method is effective for both the Librispeech-960 and our in-house Japanese dataset.
2 Attention-based encoder-decoder ASR
This section describes an attention-based encoder-decoder system that consists of an audio encoder and an attention-based decoder, which are extended to the proposed method.
2.1 Audio encoder
The audio encoder comprises two convolutional layers, a linear projection layer, and Conformer blocks [25]. The audio encoder transforms an audio feature sequence to length hidden state vectors where represents the dimension as follows:
(1) |
2.2 Attention-based decoder
The posterior probability is formulated as follows:
(2) |
where and represent the token index and the total number of tokens, respectively. Given generated by the audio encoder in Eq. (1) and the previous token sequence , the attention-based decoder recursively estimates the next token as follows:
(3) |
The attention-based decoder comprises an embedding layer with a positional encoding layer, Transformer blocks, and a linear layer. Each Transformer block has a multiheaded self-attention layer, a cross-attention layer (i.e., audio attention), and a linear layer with layer normalization (LN) layers and residual connections. Here, the audio attention layer including the LN is formulated as follows:
(4) |
where and represent the input and output of the audio attention layer, respectively. In addition, the hybrid CTC/attention model [7] includes a CTC decoder. The attention-based decoder will be extended to the proposed bias decoder in Section 3.2.
![Refer to caption](extracted/5356396/bias_decoder.png)
3 Proposed deep biasing method
Figure 1 shows the overall architecture of the proposed method, which comprises the audio encoder, bias encoder, and bias decoder. These components are described in the following subsections.
3.1 Bias encoder
The bias encoder comprises an embedding layer with a positional encoding layer, Transformer blocks, a mean pooling layer, and a bias list }, where and represent the bias phrase index and the token sequence of the -th bias phrase (e.g., “play a song”), respectively. Here, is a dummy phrase which means “no-bias”. After applying zero padding based on the max token length in the bias list , the embedding layer and the Transformer blocks extract a set of token-level feature sequences, as follows:
(5) |
Then, mean pooling is performed to extract a phrase-level feature sequence, , as follows:
(6) |
3.2 Bias decoder
The bias decoder is an extension of the attention-based decoder described in Section 2.2, where an additional cross-attention layer (i.e., bias attention) is introduced to each Transformer block, as shown in Figure 1. Unlike Eq. (2), the posterior probability is formulated using the bias list as follows:
(7) |
Given , in Eqs. (1), (6), and , the bias decoder estimates the next token recursively, unlike Eq. (3), as follows:
(8) |
In the Transformer block of the bias decoder, the bias attention layer including the LN is formulated as follows:
(9) |
In addition, the bias attention layer estimates the bias phrase index sequence as follows:
(10) |
(11) |
where denotes the -th feature vector of . For example, if a bias phrase, “play a song” with a bias index of 2 (Figure 1) is detected in a complete utterance, “I play a song today”, the bias phrase index sequence = [0, 2, 2, 2, 0]. Model parameters are optimized using the cross entropy losses as follows:
(12) |
(13) |
where and represent the one-hot vector sequences of the reference transcription and the reference bias phrase index including the no-bias option. Here, we refer to as bias phrase index loss, respectively.
3.3 Training
During the training process, a bias list is created randomly from the corresponding reference transcriptions for each batch. Specifically, 0 to bias phrases of 2 to token lengths are extracted uniformly for each utterance, resulting in a total of bias phrases (). After the bias list is extracted randomly, special tokens (sob/eob) are inserted before and after the extracted phrases in the reference transcription to distinguish whether the output tokens come from the bias list or not. The proposed method is optimized via multitask learning using the weighted sum of losses, as expressed in Eqs. (12), (13), and the CTC loss ():
(14) |
where , , and represent the training weights.
![[Uncaptioned image]](extracted/5356396/algorithm.png)
3.4 BPB beam search algorithm
We also propose a bias phrase boosted (BPB) beam search algorithm that exploits the bias phrase probability as described in Algorithm 1. The bias decoder calculates the token probability including the special tokens, sob/eob, using Eq. (8) (line 5). We then estimate the bias phrase index using Eq. (11) and the argmax function (line 6). Here, the number of bias phrases in the bias list can increase significantly during inference, which would reduce the peak value after applying the softmax function in Eq. (9). Thus, Eq. (9) is approximated using the top pruning as follows:
(15) |
Then, if = 0 (i.e., “no-bias”), the token probabilities for the special tokens [sob] and [eob] are penalized based on the weight (line 8, 9), otherwise, the corresponding token probabilities are increased according to the weight (line 11 - 13). For example, if the detected bias phrase is “play a song”, the token probabilities for “play”, “a”, and “song” are increased with . Based on the boosted probabilities , the top pruning is performed as in the conventional beam search [7].
4 Experiment
4.1 Experimental setup
The input features are 80-dimensional Mel filterbanks with a window size of 512 samples and a hop length of 160 samples. Then, SpecAugment [26] is applied. The audio encoder has two convolutional layers with a stride of two for downsampling, a 256-dimensional linear projection layer, and 12 Conformer blocks with 1024 linear units. The bias encoder and the bias decoder have three Transformer blocks with 1024 linear units and six Transformer layers with 2048 units, respectively. The attention layers in the audio encoder, the bias encoder, and the bias decoder are 4-multihead attentions with a dimension, , of 256. During the training process, a bias list is created randomly for each batch with = 2 and = 10 described in Section 3.3. In this experiment, the bias list has a total of = 50 to 200 bias phrases within a batch. The training weights , , and (described in Eq. (14)) are set to 0.3, 0.7, and 1.0, respectively. The proposed model is trained for 150 epochs at a learning rate of 0.0015 with 15,000 warmup steps using the Adam optimizer. During the decoding process, the hyper parameters of , , , and (Section 3.4) are set to 20, 50, 1.0 and 10.0, respectively.
The Librispeech corpus (960 h, 100 h) [27] is used to evaluate the proposed method using ESPnet as the E2E-ASR toolkit [28]. The proposed method is evaluated in terms of word error rate (WER), bias phrase WER (B-WER), and unbiased phrase WER (U-WER) [29]. Note that insertion errors are counted toward B-WER if the inserted phrases are present in the bias list; otherwise, insertion errors are counted toward the U-WER. The goal of the proposed method is to improve the B-WER with a slight degradation in the U-WER and overall WER.
4.2 Preliminary analysis of the proposed techniques
ID | Model | WER | U-WER | B-WER |
---|---|---|---|---|
A1 | Baseline [7] | 8.59 | 5.87 | 30.71 |
B1 | Bias decoder | 8.11 | 5.43 | 29.89 |
B2 | B1 + bias phrase index loss | 7.53 | 5.27 | 25.92 |
B3 | B2 + sob/eob tokens | 6.93 | 4.96 | 23.00 |
B4 | B3 + BPB beam search | 5.92 | 5.00 | 17.93 |
![Refer to caption](extracted/5356396/wo_attn_loss.png)
![Refer to caption](extracted/5356396/w_attn_loss.png)
= 0 (no-bias) | = 100 | = 500 | = 1000 | |||||
Model | test-clean | test-other | test-clean | test-other | test-clean | test-other | test-clean | test-other |
Baseline [7] | 3.56 | 7.55 | 3.56 | 7.55 | 3.56 | 7.55 | 3.56 | 7.55 |
(2.6/11.7) | (5.6/24.8) | (2.6/11.7) | (5.6/24.8) | (2.6/11.7) | (5.6/24.8) | (2.6/11.7) | (5.6/24.8) | |
CPPNet [21] | 4.29 | 9.16 | 3.40 | 7.77 | 3.68 | 8.31 | 3.81 | 8.75 |
(2.6/18.3) | (5.9/37.5) | (2.6/10.4) | (6.0/23.0) | (2.8/10.9) | (6.5/24.3) | (2.9/11.4) | (6.9/25.3) | |
Proposed | 5.81 | 9.17 | 2.94 | 6.21 | 3.24 | 6.56 | 4.07 | 7.60 |
w/o BPB | (4.8/13.7) | (6.8/30.1) | (2.5/6.5) | (5.4/13.1) | (2.7/7.9) | (5.5/15.9) | (3.4/9.7) | (6.4/18.6) |
Proposed | 5.05 | 8.81 | 2.75 | 5.60 | 3.21 | 6.28 | 3.47 | 7.34 |
w/ BPB | (3.9/14.1) | (6.6/27.9) | (2.3/6.0) | (4.9/12.0) | (2.7/7.0) | (5.5/13.5) | (3.0/7.7) | (6.4/15.8) |
Firstly, we verify the effect of the proposed techniques on the Librispeech-100 as a preliminary experiment. Table 1 shows the effect of the bias phrase index loss, described in Eq. (13), the special tokens for the bias phrases (sob/eob), and the BPB beam search on the Librispeech-100 test-clean evaluation set with a bias list size of = 100. Comparing with the baseline (the hybrid CTC/attention model [7]), simply introducing the bias attention layer does not improve the performance (A1 vs. B1), whereas the bias phrase index loss improves the B-WER significantly, which results in an improvement to the overall WER (B1 vs. B2). Figure 2 shows the visualization results of the bias phrase index probabilities described in Eq. (11). The bias phrase index probabilities are estimated correctly by introducing the bias phrase index loss, in Eq. (13). In addition, introducing the special tokens (sob/eob) further improves the B-WER (B2 vs. B3). Furthermore, the BPB beam search technique significantly improves the B-WER with a slight degradation in U-WER (B3 vs. B4).
4.3 Main results
Table 2 shows the results obtained by the proposed method on the Librispeech-960 data for different bias list sizes . Baseline is the hybrid CTC/attention model [7]. When the bias list size = 100, the proposed method improves the B-WER, which in turn significantly improves the U-WER and WER. In addition, the proposed BPB beam search technique further improves the B-WER without degrading the overall WER and U-WER. The B-WER and U-WER tend to deteriorate as the number of bias phrases increased; however, the proposed BPB beam search technique is particularly effective in terms of suppressing the deterioration of the B-WER. As a result, the proposed method outperforms the baseline in terms of both WER and B-WER. Although the proposed method underperforms the baseline when no bias phrases are used ( = 0), we do not consider it as a critical issue because the users usually register important keywords for them.
4.4 Analysis of the BPB beam search algorithm
Figure 3 shows the effect of the decoding weight of the BPB beam search on the Librispeech-960 test-other with a bias list size of = 100. Although, even without using the proposed BPB beam search technique, the proposed method improves the B-WER as described in Section 4.3, the BPB beam search technique further improves the B-WER. When the decoding weight , the B-WER, U-WER, and the overall WER deteriorate. The B-WER, U-WER, and the overall WER are the best at = 1.0.
![Refer to caption](extracted/5356396/onepass.png)
Figure 4 illustrates the inference results from three distinct approaches: the baseline method, our proposed method excluding the BPB beam search technique, and our proposed method incorporating the BPB beam search technique. Here, bolded face represents the bias phrases, and words in red and blue represent incorrectly and correctly recognized words, respectively. Even without the BPB beam search technique, the proposed method reduces the misrecognition of the bias phrases compared to the baseline; however, some bias phrases are not correctly recognized even when the correct bias phrase index is estimated. In contrast, the proposed BPB beam search technique recognizes the bias phrases more correctly.
![Refer to caption](extracted/5356396/example.png)
4.5 Validation on Japanese dataset
We also validate the proposed method on our in-house dataset containing 93 hours of Japanese speech data, including meeting and morning assembly scenarios, the Corpus of Spontaneous Japanese (581 h) [30], and 181 hours of Japanese speech in the database developed by the Advanced Telecommunications Research Institute International [31] with the same experimental setup described in Section 4.1. Table 3 shows the evaluation results obtained on the in-house dataset when = 203 phrases, such as personal names and technical terms, are registered in the bias list . The proposed method significantly improves the B-CER with a slight degradation in the U-CER. Thus, the proposed method is effective for both English and Japanese languages.
Model | CER | U-CER | B-CER |
---|---|---|---|
Baseline [7] | 9.85 | 8.17 | 22.32 |
Proposed (=203) | 9.78 | 9.16 | 14.54 |
Proposed w/ BPB (=203) | 9.67 | 9.20 | 13.16 |
5 Conclusion
This study introduces a deep biasing model incorporating bias phrase index loss and specialized tokens for bias phrases. Additionally, the BPB beam search technique is employed, leveraging bias phrase index probabilities to enhance accuracy. Experimental results demonstrate that our model enhances both WER and B-WER performances. Notably, the BPB beam search boosts B-WER performance with minimal impact on overall WER, evident in both English and Japanese datasets.
References
- [1] Rohit Prabhavalkar, Takaaki Hori, Tara N Sainath, Ralf Schlüter, and Shinji Watanabe, “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
- [2] **yu Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1.
- [3] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
- [4] Alex Graves, “Sequence transduction with recurrent neural networks,” in Proc. ICML, 2012.
- [5] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
- [6] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
- [7] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
- [8] Tara N Sainath et al., “Two-pass end-to-end speech recognition,” arXiv preprint arXiv:1908.10992, 2019.
- [9] Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, and Shinji Watanabe, “4D ASR: Joint modeling of CTC, attention, transducer, and mask-predict decoders,” in Proc. Interspeech, 2023, pp. 3312–3316.
- [10] Rongqing Huang, Ossama Abdel-Hamid, Xinwei Li, and Gunnar Evermann, “Class lm and word map** for contextual biasing in end-to-end asr,” in Proc. Interspeech, 2020, pp. 4348–4351.
- [11] Ian Williams, Anjuli Kannan, Petar Aleksic, David Rybach, and Tara Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search,” in Proc. Interspeech, 2018.
- [12] Atsushi Kojima, “A study of biasing technical terms in medical speech recognition using weighted finite-state transducer,” Journal of the Acoustical Society of Japan, vol. 43, pp. 66–68, 2022.
- [13] Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N Sainath, Zhijeng Chen, and Rohit Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018, pp. 5824–5828.
- [14] Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates, “Cold fusion: training seq2seq models together with language models,” in Proc. Interspeech, 2018, pp. 387–391.
- [15] Xiaoqiang Wang et al., “Towards contextual spelling correction for customization of end-to-end speech recognition systems,” IEEE Trans. Audio, Speech, Lang. Process., vol. 30, pp. 3089–3097, 2022.
- [16] Yui Sudo, Kazuya Hata, and Kazuhiro Nakadai, “Retraining-free customized asr for enharmonic words based on a named-entity-aware model and phoneme similarity estimation,” in Proc. Interspeech, 2023, pp. 3312–3316.
- [17] Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao, “Deep context: End-to-end contextual speech recognition,” in Proc. SLT, 2018, pp. 418–425.
- [18] Mahaveer Jain, Gil Keren, Jay Mahadeokar, and Yatharth Saraf, “Contextual rnn-t for open domain asr,” in Proc. Interspeech, 2020, pp. 11–15.
- [19] Antoine Bruguier, Rohit Prabhavalkar, Golan Pundak, and Tara N Sainath, “Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,” in Proc. ICASSP, 2019, pp. 6171–6175.
- [20] Saket Dingliwal, Monica Sunkara, Srikanth Ronanki, Jeff Farris, Katrin Kirchhoff, and Sravan Bodapati, “Personalization of ctc speech recognition models,” in Proc. SLT, 2023, pp. 302–309.
- [21] Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu, Tianyi Xu, and Lei Xie, “Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network,” in Proc. Interspeech, 2023, pp. 4933–4937.
- [22] Minglun Han, Linhao Dong, Zhenlin Liang, Meng Cai, Shiyu Zhou, Zejun Ma, and Bo Xu, “Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection,” in Proc. ICASSP, 2022, pp. 491–495.
- [23] Christian Huber, Juan Hussain, Sebastian Stüker, and Alexander Waibel, “Instant one-shot word-learning for context-specific neural sequence-to-sequence speech recognition,” in Proc. ASRU, 2021, pp. 1–7.
- [24] Shilin Zhou, Zhenghua Li, Yu Hong, Min Zhang, Zhefeng Wang, and Baoxing Huai, “Copyne: Better contextual asr by copying named entities,” arXiv preprint arXiv:2305.12839, 2023.
- [25] Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
- [26] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
- [27] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
- [28] Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
- [29] Duc Le, Jain, et al., “Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion,” in Proc. Interspeech, 2021, pp. 1772–1776.
- [30] Kikuo Maekawa, “Corpus of spontaneous Japanese: Its design and evaluation,” in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
- [31] Akira Kurematsu et al., “Atr japanese speech database as a tool of speech recognition and synthesis,” Speech Communication, vol. 9, no. 4, pp. 357–363, 1990.