License: CC BY 4.0
arXiv:2401.10449v1 [eess.AS] 19 Jan 2024

Contextualized Automatic Speech Recognition
with Attention-Based Bias Phrase Boosted Beam Search

Abstract

End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.

Index Terms—  speech recognition, attention, contextualization, biasing, beam search

1 Introduction

End-to-end (E2E) automatic speech recognition (ASR) [1, 2] methods directly convert acoustic feature sequences to token sequences without requiring the multiple components used in conventional ASR systems, such as acoustic models (AM) and language models (LM). Various E2E-ASR methods have been proposed previously, including connectionist temporal classification (CTC) [3], recurrent neural network transducer (RNN-T) [4], attention mechanism [5, 6], and their various hybrid systems [7, 8, 9]. Since the effectiveness of E2E-ASR methods is inherently related to the context in the training data, performance expectations may not be satisfied consistently for the given user context. For example, personal names and technical terms tend to be important keywords in different contexts, but such terms may not appear frequently in the available training data, which would result in poor recognition accuracy. It is impractical to train a model for all contexts during training; thus, the user or developer should be able to contextualize the model easily without training.

A typical approach to this problem is shallow fusion using an external LM [10, 11, 12, 13, 14]. For example, [10, 11, 12] used a weighted finite state transducer (WFST) to construct an in-class LM to facilitate contextualization for the target named entities. Neural LM fusion methods have been also proposed [13, 14]. The LM fusion technique attempts to enhance accuracy by combining an E2E-ASR model with an external neural LM and then rescoring the hypotheses generated by the E2E-ASR model. However, whether employing WFST or neural LMs, training an external LM requires additional training steps.

Thus, several methods have been proposed that do not require retraining. These methods include knowledge graph modeling [15] for recognizing out-of-vocabulary named entities, contextual spelling correction [15] using an editable phrase list, and named entity aware ASR model [16] that recognize specific named entities based on phoneme similarity. However, these methods have limitations, such as requiring a speech synthesis (TTS) model for training and not being able to handle words other than predefined target named entities.

Deep biasing methods[17, 18, 19, 20] provide an alternative approach to realize effective contextualization without requiring retraining processes and TTS models. In such methods, the E2E-ASR model can be contextualized using an editable phrase list, which is referred to as a bias list in this paper. Most deep biasing methods implement a cross-attention layer between the bias list and input sequences to recognize the bias phrases correctly. However, it has been observed that simply adding a cross-attention layer for the bias list is not effective [21]. Thus, [21, 22] introduced an additional branch designed to detect bias phrases, which indirectly helps to update the parameters of the cross-attention layer through an auxiliary loss. In contrast, [23, 24] introduced an auxiliary loss function directly on the cross-attention layer (referred to as bias phrase index loss and will be described in Section 3.2), which detects to the bias phrase index. While this approach allows for a direct parameter update of the cross-attention layer, it cannot distinguish whether the output tokens come from the bias list or not. In addition, [23] requires two-stage training using a pretrained ASR model, which is time consuming.

This paper proposes a deep biasing method that employs both an auxiliary loss directly on the cross-attention layer, termed as bias phrase index loss, and special tokens for bias phrases to realize more effective bias phrase detection. Unlike conventional indirect methods [21, 22], our method facilitates the effective training of the cross-attention layer through the bias phrase index loss. Additionally, our technique departs from current methods [23] by introducing special tokens for bias phrases. This allows the model to focus on the bias phrases more effectively, eliminating the need for a two-stage training process. Furthermore, we propose a bias phrase boosted (BPB) beam search algorithm that integrates the bias phrase index probability during inference, augmenting the performance in bias phrase recognition. The main contributions of this study are as follows:

  • We propose a deep biasing model that utilizes both bias phrase index loss and special tokens for the bias phrases.

  • We propose a bias phrase boosted (BPB) beam search algorithm to further improve the performance for the target phrases.

  • We demonstrate that the proposed method is effective for both the Librispeech-960 and our in-house Japanese dataset.

2 Attention-based encoder-decoder ASR

This section describes an attention-based encoder-decoder system that consists of an audio encoder and an attention-based decoder, which are extended to the proposed method.

2.1 Audio encoder

The audio encoder comprises two convolutional layers, a linear projection layer, and Masubscript𝑀aM_{\text{a}}italic_M start_POSTSUBSCRIPT a end_POSTSUBSCRIPT Conformer blocks [25]. The audio encoder transforms an audio feature sequence 𝑿𝑿\bm{X}bold_italic_X to T𝑇Titalic_T length hidden state vectors 𝑯=[𝒉1,,𝒉T]T×d𝑯subscript𝒉1subscript𝒉𝑇superscript𝑇𝑑\bm{H}=[\bm{h}_{1},...,\bm{h}_{T}]\in\mathbb{R}^{T\times d}bold_italic_H = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_d end_POSTSUPERSCRIPT where d𝑑ditalic_d represents the dimension as follows:

𝑯𝑯\displaystyle\bm{H}bold_italic_H =AudioEnc(𝑿).absentAudioEnc𝑿\displaystyle=\mathrm{AudioEnc}(\bm{X}).= roman_AudioEnc ( bold_italic_X ) . (1)

2.2 Attention-based decoder

The posterior probability is formulated as follows:

Patt(𝒚𝑿)=s=1SP(ys𝒚0:s1,𝑿),subscript𝑃attconditional𝒚𝑿superscriptsubscriptproduct𝑠1𝑆𝑃conditionalsubscript𝑦𝑠subscript𝒚:0𝑠1𝑿P_{\text{att}}(\bm{y}\mid\bm{X})=\prod_{s=1}^{S}P\left(y_{s}\mid\bm{y}_{0:s-1}% ,\bm{X}\right),italic_P start_POSTSUBSCRIPT att end_POSTSUBSCRIPT ( bold_italic_y ∣ bold_italic_X ) = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_X ) , (2)

where s𝑠sitalic_s and S𝑆Sitalic_S represent the token index and the total number of tokens, respectively. Given 𝑯𝑯\bm{H}bold_italic_H generated by the audio encoder in Eq. (1) and the previous token sequence 𝒚0:s1subscript𝒚:0𝑠1\bm{y}_{0:s-1}bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT, the attention-based decoder recursively estimates the next token yssubscript𝑦𝑠y_{s}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as follows:

P(ys|𝒚0:s1,𝑿)=AttnDec(𝒚0:s1,𝑯).𝑃conditionalsubscript𝑦𝑠subscript𝒚:0𝑠1𝑿AttnDecsubscript𝒚:0𝑠1𝑯\displaystyle P(y_{s}|\bm{y}_{0:s-1},\bm{X})=\mathrm{AttnDec}(\bm{y}_{0:s-1},% \bm{H}).italic_P ( italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_X ) = roman_AttnDec ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_H ) . (3)

The attention-based decoder comprises an embedding layer with a positional encoding layer, Mdsubscript𝑀dM_{\text{d}}italic_M start_POSTSUBSCRIPT d end_POSTSUBSCRIPT Transformer blocks, and a linear layer. Each Transformer block has a multiheaded self-attention layer, a cross-attention layer (i.e., audio attention), and a linear layer with layer normalization (LN) layers and residual connections. Here, the audio attention layer including the LN is formulated as follows:

𝑼=Softmax(LN(𝑼)𝑯Td)𝑯+𝑼,superscript𝑼SoftmaxLN𝑼superscript𝑯𝑇𝑑𝑯𝑼\bm{U}^{{}^{\prime}}=\mathrm{Softmax}\left(\frac{\mathrm{LN}(\bm{U})\bm{H}^{T}% }{\sqrt{d}}\right)\bm{H}+\bm{U},bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = roman_Softmax ( divide start_ARG roman_LN ( bold_italic_U ) bold_italic_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_italic_H + bold_italic_U , (4)

where 𝑼𝑼\bm{U}bold_italic_U and 𝑼superscript𝑼\bm{U}^{{}^{\prime}}bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT represent the input and output of the audio attention layer, respectively. In addition, the hybrid CTC/attention model [7] includes a CTC decoder. The attention-based decoder will be extended to the proposed bias decoder in Section 3.2.

Refer to caption
Fig. 1: Overall architecture of the proposed method, including the audio encoder, bias encoder, and bias decoder. The BPB beam search algorithm is used during inference.

3 Proposed deep biasing method

Figure 1 shows the overall architecture of the proposed method, which comprises the audio encoder, bias encoder, and bias decoder. These components are described in the following subsections.

3.1 Bias encoder

The bias encoder comprises an embedding layer with a positional encoding layer, Mesubscript𝑀eM_{\text{e}}italic_M start_POSTSUBSCRIPT e end_POSTSUBSCRIPT Transformer blocks, a mean pooling layer, and a bias list 𝑩={𝒃0,𝒃1,,𝒃N\bm{B}=\{\bm{b}_{0},\bm{b}_{1},\cdots,\bm{b}_{N}bold_italic_B = { bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_b start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT}, where n𝑛nitalic_n and 𝒃nsubscript𝒃𝑛\bm{b}_{n}bold_italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represent the bias phrase index and the token sequence of the n𝑛nitalic_n-th bias phrase (e.g., “play a song”), respectively. Here, 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a dummy phrase which means “no-bias”. After applying zero padding based on the max token length Lmaxsubscript𝐿maxL_{\text{max}}italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT in the bias list 𝑩𝑩\bm{B}bold_italic_B, the embedding layer and the Transformer blocks extract a set of token-level feature sequences, 𝑮(N+1)×Lmax×d𝑮superscript𝑁1subscript𝐿max𝑑\bm{G}\in\mathbb{R}^{(N+1)\times L_{\text{max}}\times d}bold_italic_G ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT as follows:

𝑮=Transformer(Embedding(𝑩)).𝑮TransformerEmbedding𝑩\displaystyle\bm{G}=\mathrm{Transformer}(\mathrm{Embedding}(\bm{B})).bold_italic_G = roman_Transformer ( roman_Embedding ( bold_italic_B ) ) . (5)

Then, mean pooling is performed to extract a phrase-level feature sequence, 𝑽=[𝒗0,𝒗1,,𝒗N](N+1)×d𝑽subscript𝒗0subscript𝒗1subscript𝒗𝑁superscript𝑁1𝑑\bm{V}=[\bm{v}_{0},\bm{v}_{1},\cdots,\bm{v}_{N}]\in\mathbb{R}^{(N+1)\times d}bold_italic_V = [ bold_italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N + 1 ) × italic_d end_POSTSUPERSCRIPT, as follows:

𝑽=MeanPool(𝑮).𝑽MeanPool𝑮\displaystyle\bm{V}=\mathrm{MeanPool}(\bm{G}).bold_italic_V = roman_MeanPool ( bold_italic_G ) . (6)

3.2 Bias decoder

The bias decoder is an extension of the attention-based decoder described in Section 2.2, where an additional cross-attention layer (i.e., bias attention) is introduced to each Transformer block, as shown in Figure 1. Unlike Eq. (2), the posterior probability is formulated using the bias list 𝑩𝑩\bm{B}bold_italic_B as follows:

Pbatt(𝒚𝑿,𝑩)=s=1SP(ys𝒚0:s1,𝑿,𝑩).subscript𝑃battconditional𝒚𝑿𝑩superscriptsubscriptproduct𝑠1𝑆𝑃conditionalsubscript𝑦𝑠subscript𝒚:0𝑠1𝑿𝑩P_{\text{batt}}(\bm{y}\mid\bm{X},\bm{B})=\prod_{s=1}^{S}P\left(y_{s}\mid\bm{y}% _{0:s-1},\bm{X},\bm{B}\right).italic_P start_POSTSUBSCRIPT batt end_POSTSUBSCRIPT ( bold_italic_y ∣ bold_italic_X , bold_italic_B ) = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_P ( italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_X , bold_italic_B ) . (7)

Given 𝑯𝑯\bm{H}bold_italic_H, 𝑽𝑽\bm{V}bold_italic_V in Eqs. (1), (6), and 𝒚0:s1subscript𝒚:0𝑠1\bm{y}_{0:s-1}bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT, the bias decoder estimates the next token yssubscript𝑦𝑠y_{s}italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT recursively, unlike Eq. (3), as follows:

P(ys𝒚0:s1,𝑿,𝑩)=BiasDec(𝒚0:s1,𝑯,𝑽).𝑃conditionalsubscript𝑦𝑠subscript𝒚:0𝑠1𝑿𝑩BiasDecsubscript𝒚:0𝑠1𝑯𝑽\displaystyle P\left(y_{s}\mid\bm{y}_{0:s-1},\bm{X},\bm{B}\right)=\mathrm{% BiasDec}(\bm{y}_{0:s-1},\bm{H},\bm{V}).italic_P ( italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_X , bold_italic_B ) = roman_BiasDec ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_H , bold_italic_V ) . (8)

In the Transformer block of the bias decoder, the bias attention layer including the LN is formulated as follows:

𝑼′′=Softmax(LN(𝑼)𝑽Td)𝑽+𝑼.superscript𝑼′′SoftmaxLNsuperscript𝑼superscript𝑽𝑇𝑑𝑽superscript𝑼\bm{U}^{{}^{\prime\prime}}=\mathrm{Softmax}\left(\frac{\mathrm{LN}(\bm{U}^{{}^% {\prime}})\bm{V}^{T}}{\sqrt{d}}\right)\bm{V}+\bm{U}^{{}^{\prime}}.bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = roman_Softmax ( divide start_ARG roman_LN ( bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) bold_italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_italic_V + bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT . (9)

In addition, the bias attention layer estimates the bias phrase index sequence 𝒏^=[n^1,n^2,,n^S]bold-^𝒏subscript^𝑛1subscript^𝑛2subscript^𝑛𝑆\bm{\hat{n}}=[\hat{n}_{1},\hat{n}_{2},\cdots,\hat{n}_{S}]overbold_^ start_ARG bold_italic_n end_ARG = [ over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] as follows:

Pbidx(𝒏^𝑿,𝑩)=s=1SP(n^s𝒚0:s1,𝑿,𝑩),subscript𝑃bidxconditionalbold-^𝒏𝑿𝑩superscriptsubscriptproduct𝑠1𝑆𝑃conditionalsubscript^𝑛𝑠subscript𝒚:0𝑠1𝑿𝑩P_{\text{bidx}}(\bm{\hat{n}}\mid\bm{X},\bm{B})=\prod_{s=1}^{S}P\left(\hat{n}_{% s}\mid\bm{y}_{0:s-1},\bm{X},\bm{B}\right),italic_P start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_n end_ARG ∣ bold_italic_X , bold_italic_B ) = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_P ( over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_X , bold_italic_B ) , (10)
P(n^s𝒚0:s1,𝑿,𝑩)=Softmax(LN(𝒖s)𝑽Td),𝑃conditionalsubscript^𝑛𝑠subscript𝒚:0𝑠1𝑿𝑩SoftmaxLNsubscriptsuperscript𝒖𝑠superscript𝑽𝑇𝑑P\left(\hat{n}_{s}\mid\bm{y}_{0:s-1},\bm{X},\bm{B}\right)=\mathrm{Softmax}% \left(\frac{\mathrm{LN}(\bm{u}^{{}^{\prime}}_{s})\bm{V}^{T}}{\sqrt{d}}\right),italic_P ( over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∣ bold_italic_y start_POSTSUBSCRIPT 0 : italic_s - 1 end_POSTSUBSCRIPT , bold_italic_X , bold_italic_B ) = roman_Softmax ( divide start_ARG roman_LN ( bold_italic_u start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) bold_italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , (11)

where 𝒖ssubscriptsuperscript𝒖𝑠\bm{u}^{{}^{\prime}}_{s}bold_italic_u start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denotes the s𝑠sitalic_s-th feature vector of 𝑼=[𝒖0,𝒖1,,𝒖S]superscript𝑼subscriptsuperscript𝒖0subscriptsuperscript𝒖1subscriptsuperscript𝒖𝑆\bm{U}^{{}^{\prime}}=[\bm{u}^{{}^{\prime}}_{0},\bm{u}^{{}^{\prime}}_{1},\cdots% ,\bm{u}^{{}^{\prime}}_{S}]bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = [ bold_italic_u start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_u start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_u start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ]. For example, if a bias phrase, “play a song” with a bias index of 2 (Figure 1) is detected in a complete utterance, “I play a song today”, the bias phrase index sequence 𝒏^bold-^𝒏\bm{\hat{n}}overbold_^ start_ARG bold_italic_n end_ARG = [0, 2, 2, 2, 0]. Model parameters are optimized using the cross entropy losses as follows:

Lbatt=CrossEntropy(𝒚gt,Pbatt(𝒚𝑿,𝑩)),subscript𝐿battCrossEntropysubscript𝒚gtsubscript𝑃battconditional𝒚𝑿𝑩L_{\text{batt}}=\mathrm{CrossEntropy}(\bm{y}_{\text{gt}},P_{\text{batt}}(\bm{y% }\mid\bm{X},\bm{B})),italic_L start_POSTSUBSCRIPT batt end_POSTSUBSCRIPT = roman_CrossEntropy ( bold_italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT batt end_POSTSUBSCRIPT ( bold_italic_y ∣ bold_italic_X , bold_italic_B ) ) , (12)
Lbidx=CrossEntropy(𝒏^gt,Pbidx(𝒏^𝑿,𝑩)),subscript𝐿bidxCrossEntropysubscriptbold-^𝒏gtsubscript𝑃bidxconditionalbold-^𝒏𝑿𝑩L_{\text{bidx}}=\mathrm{CrossEntropy}(\bm{\hat{n}}_{\text{gt}},P_{\text{bidx}}% (\bm{\hat{n}}\mid\bm{X},\bm{B})),italic_L start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT = roman_CrossEntropy ( overbold_^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_n end_ARG ∣ bold_italic_X , bold_italic_B ) ) , (13)

where 𝒚gtsubscript𝒚gt\bm{y_{\text{gt}}}bold_italic_y start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT and 𝒏^gtsubscriptbold-^𝒏gt\bm{\hat{n}}_{\text{gt}}overbold_^ start_ARG bold_italic_n end_ARG start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT represent the one-hot vector sequences of the reference transcription and the reference bias phrase index including the no-bias option. Here, we refer to Lbidxsubscript𝐿bidxL_{\text{bidx}}italic_L start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT as bias phrase index loss, respectively.

3.3 Training

During the training process, a bias list 𝑩𝑩\bm{B}bold_italic_B is created randomly from the corresponding reference transcriptions for each batch. Specifically, 0 to Nuttsubscript𝑁uttN_{\text{utt}}italic_N start_POSTSUBSCRIPT utt end_POSTSUBSCRIPT bias phrases of 2 to Lmaxsubscript𝐿maxL_{\text{max}}italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT token lengths are extracted uniformly for each utterance, resulting in a total of N𝑁Nitalic_N bias phrases (Nutt×nbatchsubscript𝑁uttsubscript𝑛batchN_{\text{utt}}\times n_{\text{batch}}italic_N start_POSTSUBSCRIPT utt end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT batch end_POSTSUBSCRIPT). After the bias list 𝑩𝑩\bm{B}bold_italic_B is extracted randomly, special tokens (<<<sob>>>/<<<eob>>>) are inserted before and after the extracted phrases in the reference transcription to distinguish whether the output tokens come from the bias list or not. The proposed method is optimized via multitask learning using the weighted sum of losses, as expressed in Eqs. (12), (13), and the CTC loss (Lctcsubscript𝐿ctcL_{\text{ctc}}italic_L start_POSTSUBSCRIPT ctc end_POSTSUBSCRIPT):

L=λctcLctc+λbattLbatt+λbidxLbidx,𝐿subscript𝜆ctcsubscript𝐿ctcsubscript𝜆battsubscript𝐿battsubscript𝜆bidxsubscript𝐿bidxL=\lambda_{\text{ctc}}L_{\text{ctc}}+\lambda_{\text{batt}}L_{\text{batt}}+% \lambda_{\text{bidx}}L_{\text{bidx}},italic_L = italic_λ start_POSTSUBSCRIPT ctc end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT ctc end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT batt end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT batt end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT , (14)

where λctcsubscript𝜆ctc\lambda_{\text{ctc}}italic_λ start_POSTSUBSCRIPT ctc end_POSTSUBSCRIPT, λbattsubscript𝜆batt\lambda_{\text{batt}}italic_λ start_POSTSUBSCRIPT batt end_POSTSUBSCRIPT, and λbidxsubscript𝜆bidx\lambda_{\text{bidx}}italic_λ start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT represent the training weights.

[Uncaptioned image]

3.4 BPB beam search algorithm

We also propose a bias phrase boosted (BPB) beam search algorithm that exploits the bias phrase probability as described in Algorithm 1. The bias decoder calculates the token probability 𝒑newsubscript𝒑new\bm{p}_{\text{new}}bold_italic_p start_POSTSUBSCRIPT new end_POSTSUBSCRIPT including the special tokens, <<<sob>>>/<<<eob>>>, using Eq. (8) (line 5). We then estimate the bias phrase index n^ssubscript^𝑛𝑠\hat{n}_{s}over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT using Eq. (11) and the argmax function (line 6). Here, the number of bias phrases N𝑁Nitalic_N in the bias list 𝑩𝑩\bm{B}bold_italic_B can increase significantly during inference, which would reduce the peak value after applying the softmax function in Eq. (9). Thus, Eq. (9) is approximated using the top kscoresubscript𝑘scorek_{\text{score}}italic_k start_POSTSUBSCRIPT score end_POSTSUBSCRIPT pruning as follows:

𝑼′′=Softmax(Top_kscore(LN(𝑼)𝑽Td))𝑽+𝑼.superscript𝑼′′SoftmaxTop_subscriptkscoreLNsuperscript𝑼bold-′superscript𝑽𝑇𝑑𝑽superscript𝑼bold-′\bm{U}^{{}^{\prime\prime}}=\mathrm{Softmax}\left(\mathrm{Top\_k_{score}}\left(% \frac{\mathrm{LN}(\bm{U^{{}^{\prime}}})\bm{V}^{T}}{\sqrt{d}}\right)\right)\bm{% V}+\bm{U^{{}^{\prime}}}.bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = roman_Softmax ( roman_Top _ roman_k start_POSTSUBSCRIPT roman_score end_POSTSUBSCRIPT ( divide start_ARG roman_LN ( bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) bold_italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ) bold_italic_V + bold_italic_U start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT . (15)

Then, if n^ssubscript^𝑛𝑠\hat{n}_{s}over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 (i.e., “no-bias”), the token probabilities for the special tokens 𝒑newsubscript𝒑new\bm{p}_{\text{new}}bold_italic_p start_POSTSUBSCRIPT new end_POSTSUBSCRIPT[sob] and 𝒑newsubscript𝒑new\bm{p}_{\text{new}}bold_italic_p start_POSTSUBSCRIPT new end_POSTSUBSCRIPT[eob] are penalized based on the weight αpensubscript𝛼pen\alpha_{\text{pen}}italic_α start_POSTSUBSCRIPT pen end_POSTSUBSCRIPT (line 8, 9), otherwise, the corresponding token probabilities are increased according to the weight αbonussubscript𝛼bonus\alpha_{\text{bonus}}italic_α start_POSTSUBSCRIPT bonus end_POSTSUBSCRIPT (line 11 - 13). For example, if the detected bias phrase is “play a song”, the token probabilities for “play”, “a”, and “song” are increased with αbonussubscript𝛼bonus\alpha_{\text{bonus}}italic_α start_POSTSUBSCRIPT bonus end_POSTSUBSCRIPT. Based on the boosted probabilities 𝒑newsubscript𝒑new\bm{p}_{\text{new}}bold_italic_p start_POSTSUBSCRIPT new end_POSTSUBSCRIPT, the top kbeamsubscript𝑘beamk_{\text{beam}}italic_k start_POSTSUBSCRIPT beam end_POSTSUBSCRIPT pruning is performed as in the conventional beam search [7].

4 Experiment

4.1 Experimental setup

The input features are 80-dimensional Mel filterbanks with a window size of 512 samples and a hop length of 160 samples. Then, SpecAugment [26] is applied. The audio encoder has two convolutional layers with a stride of two for downsampling, a 256-dimensional linear projection layer, and 12 Conformer blocks with 1024 linear units. The bias encoder and the bias decoder have three Transformer blocks with 1024 linear units and six Transformer layers with 2048 units, respectively. The attention layers in the audio encoder, the bias encoder, and the bias decoder are 4-multihead attentions with a dimension, d𝑑ditalic_d, of 256. During the training process, a bias list 𝑩𝑩\bm{B}bold_italic_B is created randomly for each batch with Nuttsubscript𝑁uttN_{\text{utt}}italic_N start_POSTSUBSCRIPT utt end_POSTSUBSCRIPT = 2 and Lmaxsubscript𝐿maxL_{\text{max}}italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 10 described in Section 3.3. In this experiment, the bias list 𝑩𝑩\bm{B}bold_italic_B has a total of N𝑁Nitalic_N = 50 to 200 bias phrases within a batch. The training weights λctcsubscript𝜆ctc\lambda_{\text{ctc}}italic_λ start_POSTSUBSCRIPT ctc end_POSTSUBSCRIPT, λbattsubscript𝜆batt\lambda_{\text{batt}}italic_λ start_POSTSUBSCRIPT batt end_POSTSUBSCRIPT, and λbidxsubscript𝜆bidx\lambda_{\text{bidx}}italic_λ start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT (described in Eq. (14)) are set to 0.3, 0.7, and 1.0, respectively. The proposed model is trained for 150 epochs at a learning rate of 0.0015 with 15,000 warmup steps using the Adam optimizer. During the decoding process, the hyper parameters of kbeamsubscript𝑘beamk_{\text{beam}}italic_k start_POSTSUBSCRIPT beam end_POSTSUBSCRIPT, kscoresubscript𝑘scorek_{\text{score}}italic_k start_POSTSUBSCRIPT score end_POSTSUBSCRIPT, αbonussubscript𝛼bonus\alpha_{\text{bonus}}italic_α start_POSTSUBSCRIPT bonus end_POSTSUBSCRIPT, and αpensubscript𝛼pen\alpha_{\text{pen}}italic_α start_POSTSUBSCRIPT pen end_POSTSUBSCRIPT (Section 3.4) are set to 20, 50, 1.0 and 10.0, respectively.

The Librispeech corpus (960 h, 100 h) [27] is used to evaluate the proposed method using ESPnet as the E2E-ASR toolkit [28]. The proposed method is evaluated in terms of word error rate (WER), bias phrase WER (B-WER), and unbiased phrase WER (U-WER) [29]. Note that insertion errors are counted toward B-WER if the inserted phrases are present in the bias list; otherwise, insertion errors are counted toward the U-WER. The goal of the proposed method is to improve the B-WER with a slight degradation in the U-WER and overall WER.

4.2 Preliminary analysis of the proposed techniques

Table 1: Preliminary analysis on the Librispeech-100 test-clean.
ID Model WER U-WER B-WER
A1 Baseline [7] 8.59 5.87 30.71
B1 Bias decoder 8.11 5.43 29.89
B2 B1 + bias phrase index loss 7.53 5.27 25.92
B3 B2 + <<<sob>>>/<<<eob>>> tokens 6.93 4.96 23.00
B4 B3 + BPB beam search 5.92 5.00 17.93
Refer to caption
(a) Without bias phrase index loss
Refer to caption
(b) With bias phrase index loss
Fig. 2: Effect of the bias phrase index loss. The horizontal and vertical axes show token index s𝑠sitalic_s and bias phrases in 𝑩𝑩\bm{B}bold_italic_B, respectively.
Table 2: Main WER results obtained on Librispeech-960 data (U-WER/B-WER). Bold values indicate cases where the proposed method outperformed the baselines, and underlined values represent the best results.
N𝑁Nitalic_N = 0 (no-bias) N𝑁Nitalic_N = 100 N𝑁Nitalic_N = 500 N𝑁Nitalic_N = 1000
Model test-clean test-other test-clean test-other test-clean test-other test-clean test-other
Baseline [7] 3.56 7.55 3.56 7.55 3.56 7.55 3.56 7.55
(2.6/11.7) (5.6/24.8) (2.6/11.7) (5.6/24.8) (2.6/11.7) (5.6/24.8) (2.6/11.7) (5.6/24.8)
CPPNet [21] 4.29 9.16 3.40 7.77 3.68 8.31 3.81 8.75
(2.6/18.3) (5.9/37.5) (2.6/10.4) (6.0/23.0) (2.8/10.9) (6.5/24.3) (2.9/11.4) (6.9/25.3)
Proposed 5.81 9.17 2.94 6.21 3.24 6.56 4.07 7.60
w/o BPB (4.8/13.7) (6.8/30.1) (2.5/6.5) (5.4/13.1) (2.7/7.9) (5.5/15.9) (3.4/9.7) (6.4/18.6)
Proposed 5.05 8.81 2.75 5.60 3.21 6.28 3.47 7.34
w/ BPB (3.9/14.1) (6.6/27.9) (2.3/6.0) (4.9/12.0) (2.7/7.0) (5.5/13.5) (3.0/7.7) (6.4/15.8)

Firstly, we verify the effect of the proposed techniques on the Librispeech-100 as a preliminary experiment. Table 1 shows the effect of the bias phrase index loss, Lbidxsubscript𝐿bidxL_{\text{bidx}}italic_L start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT described in Eq. (13), the special tokens for the bias phrases (<<<sob>>>/<<<eob>>>), and the BPB beam search on the Librispeech-100 test-clean evaluation set with a bias list size of N𝑁Nitalic_N = 100. Comparing with the baseline (the hybrid CTC/attention model [7]), simply introducing the bias attention layer does not improve the performance (A1 vs. B1), whereas the bias phrase index loss improves the B-WER significantly, which results in an improvement to the overall WER (B1 vs. B2). Figure 2 shows the visualization results of the bias phrase index probabilities described in Eq. (11). The bias phrase index probabilities are estimated correctly by introducing the bias phrase index loss, Lbidxsubscript𝐿bidxL_{\text{bidx}}italic_L start_POSTSUBSCRIPT bidx end_POSTSUBSCRIPT in Eq. (13). In addition, introducing the special tokens (<<<sob>>>/<<<eob>>>) further improves the B-WER (B2 vs. B3). Furthermore, the BPB beam search technique significantly improves the B-WER with a slight degradation in U-WER (B3 vs. B4).

4.3 Main results

Table 2 shows the results obtained by the proposed method on the Librispeech-960 data for different bias list sizes N𝑁Nitalic_N. Baseline is the hybrid CTC/attention model [7]. When the bias list size N𝑁Nitalic_N = 100, the proposed method improves the B-WER, which in turn significantly improves the U-WER and WER. In addition, the proposed BPB beam search technique further improves the B-WER without degrading the overall WER and U-WER. The B-WER and U-WER tend to deteriorate as the number of bias phrases N𝑁Nitalic_N increased; however, the proposed BPB beam search technique is particularly effective in terms of suppressing the deterioration of the B-WER. As a result, the proposed method outperforms the baseline in terms of both WER and B-WER. Although the proposed method underperforms the baseline when no bias phrases are used (N𝑁Nitalic_N = 0), we do not consider it as a critical issue because the users usually register important keywords for them.

4.4 Analysis of the BPB beam search algorithm

Figure 3 shows the effect of the decoding weight αbonussubscript𝛼bonus\alpha_{\text{bonus}}italic_α start_POSTSUBSCRIPT bonus end_POSTSUBSCRIPT of the BPB beam search on the Librispeech-960 test-other with a bias list size of N𝑁Nitalic_N = 100. Although, even without using the proposed BPB beam search technique, the proposed method improves the B-WER as described in Section 4.3, the BPB beam search technique further improves the B-WER. When the decoding weight αbonus>1.5subscript𝛼bonus1.5\alpha_{\text{bonus}}>1.5italic_α start_POSTSUBSCRIPT bonus end_POSTSUBSCRIPT > 1.5, the B-WER, U-WER, and the overall WER deteriorate. The B-WER, U-WER, and the overall WER are the best at αbonussubscript𝛼bonus\alpha_{\text{bonus}}italic_α start_POSTSUBSCRIPT bonus end_POSTSUBSCRIPT = 1.0.

Refer to caption
Fig. 3: Effect of the decoding weight αbonussubscript𝛼bonus\alpha_{\text{bonus}}italic_α start_POSTSUBSCRIPT bonus end_POSTSUBSCRIPT of the BPB beam search on Librispeech-960.

Figure 4 illustrates the inference results from three distinct approaches: the baseline method, our proposed method excluding the BPB beam search technique, and our proposed method incorporating the BPB beam search technique. Here, bolded face represents the bias phrases, and words in red and blue represent incorrectly and correctly recognized words, respectively. Even without the BPB beam search technique, the proposed method reduces the misrecognition of the bias phrases compared to the baseline; however, some bias phrases are not correctly recognized even when the correct bias phrase index is estimated. In contrast, the proposed BPB beam search technique recognizes the bias phrases more correctly.

Refer to caption
Fig. 4: Typical example. Bolded faces, red and blue faces represent the bias phrases, incorrectly and correctly recognized, respectively.

4.5 Validation on Japanese dataset

We also validate the proposed method on our in-house dataset containing 93 hours of Japanese speech data, including meeting and morning assembly scenarios, the Corpus of Spontaneous Japanese (581 h) [30], and 181 hours of Japanese speech in the database developed by the Advanced Telecommunications Research Institute International [31] with the same experimental setup described in Section 4.1. Table 3 shows the evaluation results obtained on the in-house dataset when N𝑁Nitalic_N = 203 phrases, such as personal names and technical terms, are registered in the bias list 𝑩𝑩\bm{B}bold_italic_B. The proposed method significantly improves the B-CER with a slight degradation in the U-CER. Thus, the proposed method is effective for both English and Japanese languages.

Table 3: Experimental results on our in-house Japanese dataset.
Model CER U-CER B-CER
Baseline [7] 9.85 8.17 22.32
Proposed (N𝑁Nitalic_N=203) 9.78 9.16 14.54
Proposed w/ BPB (N𝑁Nitalic_N=203) 9.67 9.20 13.16

5 Conclusion

This study introduces a deep biasing model incorporating bias phrase index loss and specialized tokens for bias phrases. Additionally, the BPB beam search technique is employed, leveraging bias phrase index probabilities to enhance accuracy. Experimental results demonstrate that our model enhances both WER and B-WER performances. Notably, the BPB beam search boosts B-WER performance with minimal impact on overall WER, evident in both English and Japanese datasets.

References

  • [1] Rohit Prabhavalkar, Takaaki Hori, Tara N Sainath, Ralf Schlüter, and Shinji Watanabe, “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
  • [2] **yu Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1.
  • [3] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
  • [4] Alex Graves, “Sequence transduction with recurrent neural networks,” in Proc. ICML, 2012.
  • [5] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
  • [6] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
  • [7] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [8] Tara N Sainath et al., “Two-pass end-to-end speech recognition,” arXiv preprint arXiv:1908.10992, 2019.
  • [9] Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, and Shinji Watanabe, “4D ASR: Joint modeling of CTC, attention, transducer, and mask-predict decoders,” in Proc. Interspeech, 2023, pp. 3312–3316.
  • [10] Rongqing Huang, Ossama Abdel-Hamid, Xinwei Li, and Gunnar Evermann, “Class lm and word map** for contextual biasing in end-to-end asr,” in Proc. Interspeech, 2020, pp. 4348–4351.
  • [11] Ian Williams, Anjuli Kannan, Petar Aleksic, David Rybach, and Tara Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search,” in Proc. Interspeech, 2018.
  • [12] Atsushi Kojima, “A study of biasing technical terms in medical speech recognition using weighted finite-state transducer,” Journal of the Acoustical Society of Japan, vol. 43, pp. 66–68, 2022.
  • [13] Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N Sainath, Zhijeng Chen, and Rohit Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP, 2018, pp. 5824–5828.
  • [14] Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates, “Cold fusion: training seq2seq models together with language models,” in Proc. Interspeech, 2018, pp. 387–391.
  • [15] Xiaoqiang Wang et al., “Towards contextual spelling correction for customization of end-to-end speech recognition systems,” IEEE Trans. Audio, Speech, Lang. Process., vol. 30, pp. 3089–3097, 2022.
  • [16] Yui Sudo, Kazuya Hata, and Kazuhiro Nakadai, “Retraining-free customized asr for enharmonic words based on a named-entity-aware model and phoneme similarity estimation,” in Proc. Interspeech, 2023, pp. 3312–3316.
  • [17] Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao, “Deep context: End-to-end contextual speech recognition,” in Proc. SLT, 2018, pp. 418–425.
  • [18] Mahaveer Jain, Gil Keren, Jay Mahadeokar, and Yatharth Saraf, “Contextual rnn-t for open domain asr,” in Proc. Interspeech, 2020, pp. 11–15.
  • [19] Antoine Bruguier, Rohit Prabhavalkar, Golan Pundak, and Tara N Sainath, “Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,” in Proc. ICASSP, 2019, pp. 6171–6175.
  • [20] Saket Dingliwal, Monica Sunkara, Srikanth Ronanki, Jeff Farris, Katrin Kirchhoff, and Sravan Bodapati, “Personalization of ctc speech recognition models,” in Proc. SLT, 2023, pp. 302–309.
  • [21] Kaixun Huang, Ao Zhang, Zhanheng Yang, Pengcheng Guo, Bingshen Mu, Tianyi Xu, and Lei Xie, “Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network,” in Proc. Interspeech, 2023, pp. 4933–4937.
  • [22] Minglun Han, Linhao Dong, Zhenlin Liang, Meng Cai, Shiyu Zhou, Zejun Ma, and Bo Xu, “Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection,” in Proc. ICASSP, 2022, pp. 491–495.
  • [23] Christian Huber, Juan Hussain, Sebastian Stüker, and Alexander Waibel, “Instant one-shot word-learning for context-specific neural sequence-to-sequence speech recognition,” in Proc. ASRU, 2021, pp. 1–7.
  • [24] Shilin Zhou, Zhenghua Li, Yu Hong, Min Zhang, Zhefeng Wang, and Baoxing Huai, “Copyne: Better contextual asr by copying named entities,” arXiv preprint arXiv:2305.12839, 2023.
  • [25] Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  • [26] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
  • [27] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  • [28] Shinji Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
  • [29] Duc Le, Jain, et al., “Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion,” in Proc. Interspeech, 2021, pp. 1772–1776.
  • [30] Kikuo Maekawa, “Corpus of spontaneous Japanese: Its design and evaluation,” in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
  • [31] Akira Kurematsu et al., “Atr japanese speech database as a tool of speech recognition and synthesis,” Speech Communication, vol. 9, no. 4, pp. 357–363, 1990.