\interspeechcameraready\name

AtsunoriOgawa \nameNaoyukiKamo \nameKoheiMatsuura \nameTakanoriAshihara \nameTakafumiMoriya \name
TakatomoKano \nameNaohiroTawara \nameMarcDelcroix

Applying LLMs for rescoring N-best ASR hypotheses of casual conversations:
Effects of domain adaptation and context carry-over

Abstract

Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N𝑁Nitalic_N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LLMs, and the CHiME-7 DASR task provides datasets of casual conversations between multiple participants. We investigate the effects of domain adaptation of the LLM and context carry-over when performing N𝑁Nitalic_N-best rescoring. Experimental results show that, even without domain adaptation, Llama2 outperforms a standard-size domain-adapted Transformer-LM, especially when using a long context. Domain adaptation shortens the context length needed with Llama2 to achieve its best performance, i.e., it reduces the computational cost of Llama2.

keywords:
speech recognition, casual conversation, large language model, N𝑁Nitalic_N-best rescoring, domain adaptation, context carry-over

1 Introduction

Large language models (LLMs), such as GPT-4 [1], PaLM2 [2], and Llama2 (Large Language Model META AI) [3], have now become a prominent component in modern natural language processing (NLP) and are successfully utilized in various NLP tasks, such as machine translation, text summarization, and question answering. Recently, they have been used not only in NLP tasks but also in speech-related tasks, including automatic speech recognition (ASR). A simple way to utilize LLMs in ASR is using them in the second-pass rescoring (re-ranking) of multiple ASR hypotheses represented as an N𝑁Nitalic_N-best list or a lattice, which is obtained by the first-pass ASR decoding. Several studies have reported the usefulness of LLMs in N𝑁Nitalic_N-best or lattice rescoring of ASR hypotheses [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16].

Thanks to the significant progress of end-to-end (E2E) neural network modeling, the performance of ASR has greatly improved. Despite this significant progress, ASR accuracy remains unsatisfactory in some situations, such as performing ASR in daily-life environments [17, 18, 19, 20, 21, 22]. The distant ASR (DASR) task of the CHiME-7 challenge provides a dataset of such challenging situations [17]. The dataset contains casual conversations between multiple participants at real dinner parties. LMs can be expected to play an important role in ASR of such casual conversational speech, and most of the submitted systems try to use LMs during ASR decoding and/or for rescoring ASR hypotheses [19, 20, 21, 22]. However, the effect of using LMs is limited (the first-place system does not use any LMs [18]), and there is a demand for LMs to deal with such highly casual conversational speech.

As described above, several studies have successfully applied LLMs for rescoring ASR hypotheses [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. However, their targets are not casual conversations, and the ability of LLMs to rescore ASR hypotheses of casual conversations remains unclear (note that LLMs are not allowed to be used in the CHiME-7 challenge [17]). In this study, we reveal it by performing N𝑁Nitalic_N-best ASR hypotheses rescoring using Llama2-7B [3], which is one of the most representative Transformer [23] decoder-based causal LLMs, on the CHiME-7 DASR task. We comprehensively investigate the effects of domain adaptation of the LLM and context carry-over [9, 12, 13, 19] when performing N𝑁Nitalic_N-best rescoring. We employ QLoRA [24] for memory efficient domain adaptation and consider various context lengths (up to 1024 tokens) in context carry-over.

We conducted experiments, including experimental settings that have not been investigated in previous studies [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], and thus, the experimental results and findings obtained in this study are informative for researchers in this field (note that Llama2-7B is allowed to be used in the CHiME-8 challenge [25]). Our main findings can be summarized as follows.

  • Even without domain adaptation, Llama2 significantly outperforms a standard-size domain-adapted Transformer-LM.

  • Both domain adaptation and context carry-over improve the Llama2 performance.

  • Even without domain adaptation, by considering a very long context (e.g., 1024 tokens), Llama2 captures the flow of a conversation and achieves the lowest word error rate (WER), which is achieved with the domain-adapted Llama2.

  • Domain adaptation shortens the context length needed with Llama2 to achieve the lowest WER, significantly reducing the computational cost of Llama2.

2 Relation to prior work

Previous studies [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] use both Transformer encoder-based bidirectional LLMs, such as BERT [26], RoBERTa [27], and ELECTRA [28], and Transformer decoder-based unidirectional LLMs, such as GPT [29], GPT-2 [30], PaLM [31] and Llama1 [32], but focus more on the former encoder-based LLMs. In contrast, in this study, we focus on a decoder-based LLM, i.e., Llama2 [3], since recently released LLMs are mainly decoder-based, e.g., GPT-4 [1], PaLM2 [2], and Llama2, and we can expect their further progress.

Some previous studies [5, 7, 9, 11, 12, 14] use moderately conversational datasets, such as Switchboard (conversations on telephone calls) [33], AMI (conversations on meetings) [34], and an in-house dataset (conversations with a conversational agent) [11, 14]. In contrast, in this study, we use the CHiME-7 DASR task dataset (conversations at dinner parties) [17], which is much more casual and challenging than the above datasets, to reveal the applicability of LLMs for rescoring ASR hypotheses of highly casual conversations.

Considering past and future contexts is useful for rescoring current ASR hypotheses, and some previous studies perform context carry-over [9, 12, 13]. The past context is used with both encoder-based bidirectional LLMs and decoder-based unidirectional LLMs, while the future context is used only with encoder-based LLMs. In this study, we utilize only the past context since we use Llama2, but we comprehensively investigate the effect of the context length by varying it in a wide range, i.e., 0 (without considering the context) to 1024 tokens. The context length investigated in this study is much longer than that investigated in the previous studies, i.e., up to 180 tokens [9].

3 Models and methods

We introduce the LMs used in this study, the domain adaptation methods of the LMs, the N𝑁Nitalic_N-best rescoring method with context-carry over, and text preprocessing.

3.1 Language models

We use Llama2-7B [3] as the main LLM. As a competitor, we also prepared a standard-size Transformer-LM. We used the Llama2 tokenizer (its vocabulary size is 32k BPE [35, 36] tokens) as that of the standard-size Transformer-LM, and thus, we can fairly compare these two models in terms of perplexity (PPL). To build the standard-size Transformer-LM, we first copied the configuration of Llama2-7B and edited it to define a downsized model structure, and then we trained the configurated model from scratch using a text dataset. The model size (number of model parameters) is about 70M, i.e., 1/100 of the Llama2-7B size, which is the standard size of a Transformer-LM. This model inherits the configuration of Llama2-7B, and thus, in this study, we refer to it as Slama2-70M, i.e., Standard-size (or Smaller-size) of Llama2. Details of Slama2-70M are described in Section 4.1.

We also use Llama2-7B-Chat, which is a fine-tuned version of Llama2-7B that is optimized for dialogue use cases [3], since it may be more suitable than the base Llama2-7B for rescoring ASR hypotheses of casual conversation. We investigate which model is more suitable for the target in Section 4.3.

3.2 Domain adaptation

Llama2 is trained using massive text datasets and is expected to have general linguistic knowledge. However, conversations contained in the CHiME-7 DASR task dataset are highly casual, and thus, transcriptions of such conversations may not be included in the Llama2 training text datasets (their details are not opened [3]). We employ QLoRA [24] to adapt Llama2 to the target casual conversational domain with its memory efficient way. With QLoRA, a 4-bit quantized large number of the LLM parameters are frozen, while a small number of low-rank adapters (LoRA) [37] are fine-tuned using a smaller-size target-domain text dataset. As regards domain adaptation of Slama2, we perform full parameter fine-tuning. Details of domain adaptation are described in Section 4.1.

3.3 N-best rescoring with context carry-over

Let 𝐗isubscript𝐗𝑖{\mathbf{X}}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be a feature vector sequence of the i𝑖iitalic_ith utterance in an input utterance sequence. As the first-pass ASR decoding, an E2E ASR model decodes 𝐗isubscript𝐗𝑖{\mathbf{X}}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and outputs N𝑁Nitalic_N-best ASR hypotheses (an N𝑁Nitalic_N-best list) of the input utterance as {𝐰ir}r=1Nsuperscriptsubscriptsuperscriptsubscript𝐰𝑖𝑟𝑟1𝑁\{{\mathbf{w}}_{i}^{r}\}_{r=1}^{N}{ bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where 𝐰irsuperscriptsubscript𝐰𝑖𝑟{\mathbf{w}}_{i}^{r}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is the r𝑟ritalic_rth rank hypothesis (token sequence). The ASR model provides the score (log-probability) for each of the N𝑁Nitalic_N-best hypotheses as {logP𝚊𝚜𝚛(𝐰ir|𝐗i)}r=1Nsuperscriptsubscriptsubscript𝑃𝚊𝚜𝚛conditionalsuperscriptsubscript𝐰𝑖𝑟subscript𝐗𝑖𝑟1𝑁\{\log{P_{\mathtt{asr}}}({\mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})\}_{r=1}^{N}{ roman_log italic_P start_POSTSUBSCRIPT typewriter_asr end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

Then, as the second-pass post-processing, we perform N𝑁Nitalic_N-best rescoring. We first calculate the LM score (log-probability) for each of the N𝑁Nitalic_N-best hypotheses as {logP𝚕𝚖(𝐰ir)}r=1Nsuperscriptsubscriptsubscript𝑃𝚕𝚖superscriptsubscript𝐰𝑖𝑟𝑟1𝑁\{\log{P_{\mathtt{lm}}}({\mathbf{w}}_{i}^{r})\}_{r=1}^{N}{ roman_log italic_P start_POSTSUBSCRIPT typewriter_lm end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT using an LM. Next, for each rank, i.e., r=1,,N𝑟1𝑁r=1,{\cdots},Nitalic_r = 1 , ⋯ , italic_N, we combine the ASR and LM scores as,

logP(𝐰ir|𝐗i)=logP𝚊𝚜𝚛(𝐰ir|𝐗i)+αlogP𝚕𝚖(𝐰ir)+γ|𝐰ir|,𝑃conditionalsuperscriptsubscript𝐰𝑖𝑟subscript𝐗𝑖subscript𝑃𝚊𝚜𝚛conditionalsuperscriptsubscript𝐰𝑖𝑟subscript𝐗𝑖𝛼subscript𝑃𝚕𝚖superscriptsubscript𝐰𝑖𝑟𝛾superscriptsubscript𝐰𝑖𝑟{\log}{P}({\mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})=\log{P_{\mathtt{asr}}}({% \mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})+\alpha\log{P_{\mathtt{lm}}}({\mathbf{w}}% _{i}^{r})+\gamma\lvert{{\mathbf{w}}_{i}^{r}}\rvert,roman_log italic_P ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_log italic_P start_POSTSUBSCRIPT typewriter_asr end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α roman_log italic_P start_POSTSUBSCRIPT typewriter_lm end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) + italic_γ | bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | , (1)

where α𝛼\alphaitalic_α (α0𝛼0\alpha\geq 0italic_α ≥ 0) is the language weight and γ𝛾\gammaitalic_γ (γ0𝛾0\gamma\geq 0italic_γ ≥ 0) is the reward that is given proportional to the length of 𝐰irsuperscriptsubscript𝐰𝑖𝑟{\mathbf{w}}_{i}^{r}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Lastly, we select the best (the highest score rank) hypothesis based on the combined score logP(𝐰ir|𝐗i)𝑃conditionalsuperscriptsubscript𝐰𝑖𝑟subscript𝐗𝑖{\log}P({\mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})roman_log italic_P ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in Eq. (1) as the final 1-best ASR hypothesis.

In the above basic N𝑁Nitalic_N-best rescoring procedure, we focus on the current hypotheses. However, considering the past hypotheses sequence as the context is effective for rescoring the current hypotheses, especially for the conversational speech case. In this study, as with some previous studies [9, 12, 13, 19], we perform context carry-over in N𝑁Nitalic_N-best rescoring. To consider the context, we modify the LM score in Eq. (1) as,

logP𝚕𝚖(𝐰ir)logP𝚕𝚖(𝐰ir|𝐰L:1𝚋𝚎𝚜𝚝),subscript𝑃𝚕𝚖superscriptsubscript𝐰𝑖𝑟subscript𝑃𝚕𝚖conditionalsuperscriptsubscript𝐰𝑖𝑟superscriptsubscript𝐰:𝐿1𝚋𝚎𝚜𝚝\log{P_{\mathtt{lm}}}({\mathbf{w}}_{i}^{r})\rightarrow\log{P_{\mathtt{lm}}}({% \mathbf{w}}_{i}^{r}|{\mathbf{w}}_{-L:-1}^{{\mathtt{best}}}),roman_log italic_P start_POSTSUBSCRIPT typewriter_lm end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) → roman_log italic_P start_POSTSUBSCRIPT typewriter_lm end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | bold_w start_POSTSUBSCRIPT - italic_L : - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_best end_POSTSUPERSCRIPT ) , (2)

where 𝐰L:1𝚋𝚎𝚜𝚝superscriptsubscript𝐰:𝐿1𝚋𝚎𝚜𝚝{\mathbf{w}}_{-L:-1}^{{\mathtt{best}}}bold_w start_POSTSUBSCRIPT - italic_L : - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_best end_POSTSUPERSCRIPT is the best past context (token sequence) of the length (number of tokens) L𝐿Litalic_L obtained by N𝑁Nitalic_N-best rescoring for the past N𝑁Nitalic_N-best hypotheses sequence. Note that, in this study, we do not care about the hypothesis (utterance) boundaries, i.e., the best past context can start from the middle of a past 1-best hypothesis. Note also that, as with N𝑁Nitalic_N-best rescoring, we can perform PPL calculation with context-carry over. We comprehensively investigate the effect of the context length L𝐿Litalic_L by varying it in a wide range in Section 4.2.

3.4 Text processing

The authors of [19], who submitted the second-place system of the CHiME-7 challenge, ordered utterances (sentences) in the training text dataset as, speaker 1’s utterance 1, utterance 2, …, speaker 2’s utterance 1, utterance 2, …, and trained an LM (they performed N𝑁Nitalic_N-best rescoring by applying the same ordering to ASR hypotheses). This speaker-conditioned ordering is based on the assumption that utterances from one speaker have some consistency, and, within the speaker, the past utterances are useful in predicting the current utterance. However, this ordering ignores the flow of a conversation. We investigate which of the speaker-conditioned order or the conversational order is more suitable for the CHiME-7 DASR task in Section 4.3.

Llama2 is trained using texts that preserve their original forms [32, 3], i.e., the texts preserve capitalized characters and symbols, such as commas, periods, (double) quotations, (semi-) colons, question/exclamation marks, and so on. In contrast, texts used in the ASR research field, including texts in the CHiME-7 DASR task dataset, are usually heavily normalized, i.e., all the characters in the texts are lowercased, and all the symbols are removed from the texts. It is not clear whether Llama2 can appropriately treat these heavily normalized texts. However, what we can do to recover the original texts is limited. In this study, we add a period for each sentence (or hypothesis in N𝑁Nitalic_N-best rescoring). What else we can do is capitalize the first character for each sentence (but it is difficult to recover other capitalization, e.g., named entities). We investigate whether this capitalization of the first character is effective for Llama2 in Section 4.3.

4 Experiments

We conducted N𝑁Nitalic_N-best rescoring experiments using the CHiME-7 DASR task dataset [17] on the PyTorch [38] environment. We used ESPnet [39] for ASR model training and decoding. We also used Hugging Face Transformers [40] with the PEFT library [41] for LM training, domain adaptation, and inference.

4.1 Experimental settings

The CHiME-7 DASR task dataset [17] consists of the three datasets, i.e., CHiME-6 [42], DiPCo [43], and Mixer 6 [44]. The former two datasets contain conversations between four participants at real dinner parties, while Mixer 6 contains conversations between an interviewer and a subject. CHiME-6 and Mixer 6 have the training, development (dev), and evaluation (eval) data splits, while DiPCo has the dev and eval data splits. We used the CHiME-6 and Mixer 6 (CH6+Mx6) combined training dataset for LM domain adaptation, the CHiME-6 dev dataset for hyperparameter tuning, and all the dev and eval datasets for evaluation. Table 1 shows details of these datasets, and further details can be found in [17, 42, 43, 44]. As described in Section 3.4, we sorted all the sentences (utterances) in these datasets in the conversational order (not the speaker-conditioned order [19]) and added a period for each sentence (but we did not perform any capitalization).

For domain adaptation of Llama2, we attached LoRA adapters [37] to all the query and value projection matrices in the attention modules of Llama2 and fine-tuned them with QLoRA [24] (Section 3.2) using the CH6+Mx6 training dataset shown in Table 1. The ratio of the number of trainable parameters against that of all parameters was 0.06%. We set the context length (number of tokens) L𝐿Litalic_L in Eq. (2) at 0, 16, 32, 64, 128, 256, 512, and 1024, respectively. For each of these context lengths L𝐿Litalic_L, we concatenated past L𝐿Litalic_L tokens as the context to all the sentences in the dataset and performed fine-tuning. We performed one epoch QLoRA fine-tuning using the AdamW optimizer [45] by setting the LoRA rank, LoRA alpha scaling parameter, LoRA dropout probability, batch size, and learning rate at 8, 16, 0.05, 64, 1e-5, respectively. As a result, we obtained eight domain-adapted Llama2 models.

Table 2 shows the configuration of Slama2-70B (Section 3.1) in comparison with that of Llama2-7B [3]. We trained Slama2 using 1.1G tokens of the LibriSpeech text dataset [46]. We concatenated all the sentences (token sequences) in the dataset to form one long token sequence and split it into token sequences of length 2048, which is the maximum positional embedding length of Slama2, as shown in Table 2. We trained Slama2 from scratch using these token sequences and then performed domain adaptation of it. For each of the eight context lengths L𝐿Litalic_L, we applied the same text processing described above to the CH6+Mx6 training dataset and performed fine-tuning of Slama2 using the dataset. We performed one epoch full parameter fine-tuning using the AdamW optimizer by setting the batch size and learning rate at 64 and 5e-6, respectively. As a result, we obtained eight domain-adapted Slama2 models.

As the E2E ASR model, we trained a competitive model based on a Conformer-encoder [47] and a structured state space (S4) decoder [48], which is used in the third-place system [20] of the CHiME-7 challenge. Using this ASR model, we performed ASR for all the dev and eval utterances and generated 32-best ASR hypotheses for each of the utterances. We did not use any LMs in ASR decoding. As with the above-described text processing, we sorted the ASR hypotheses in the conversational order and added a period for each hypothesis. Then, using Llama2, the domain-adapted Slama2/Llama2 of the eight context lengths L𝐿Litalic_L (17 models in total), respectively, we performed rescoring for the 32-best ASR hypotheses. When using Llama2, we set the language weight α𝛼\alphaitalic_α and the reward γ𝛾\gammaitalic_γ in Eq. (1) at 0.4 and 0.5, respectively, and when using Slama2, we set them at 0.3 and 0.5, respectively. We optimized these values using the CHiME-6 dev dataset. We also performed token-based PPL evaluation for all the dev and eval transcriptions (correct token sequences).

Table 1: Details of the CHiME-7 DASR task dataset. The numbers of words and tokens are counted using the manual transcriptions (correct sentences). However, we can obtain almost the same numbers with ASR hypotheses. # tokens per word similar-to-or-equals\simeq 1.5 for all the datasets. For example, in the case of the CHiME-6 dev dataset, the context length L === 1024 tokens corresponds to about 76 utterances (1024 / 13.4 similar-to-or-equals\simeq 76).
CH6+Mx6 CHiME-6
Training Dev Eval
# utts (# sents) 120k 6.6k 18.2k
# words 994k 58.9k 101k
# tokens 1.48M 89.1k 164k
# words per utt 8.3 8.9 5.5
# tokens per utt 12.4 13.4 9.0
DiPCo Mixer 6
Dev Eval Dev Eval
# utts (# sents) 3.7k 3.4k 14.8k 5.1k
# words 30.0k 28.8k 149k 69.3k
# tokens 45.9k 43.2k 215k 96.1k
# words per utt 8.2 8.5 10.1 13.6
# tokens per utt 12.5 12.7 14.5 18.8
Table 2: Configurations of Llama2-7B and Slama2-70M.
Llama2-7B Slama2-70M
Number of hidden layers 32 8
Hidden size 4096 512
Number of attention heads 32 8
Intermediate (FFN) size 11008 2048
Max positional embeddings 4096 2048

4.2 Results of PPL evaluation and N-best rescoring

Table 3 shows the results of PPL evaluation and N𝑁Nitalic_N-best rescoring. First, we can confirm that, in some cases, the domain-adapted Slama2 reduces the word error rates (WERs) from the strong ASR 1-best baseline. The longer contexts bring the lower WERs (and PPLs). However, the reduction of the WERs is limited, as reported in the CHiME-7 papers [19, 20, 21, 22].

Next, we compare the results of Slama2 and Llama2 without domain adaptation. We can confirm that, with the shorter context lengths (especially when L=0𝐿0L{=}0italic_L = 0), Llama2 underperforms Slama2. However, its performance is quickly improved by considering longer contexts, i.e., by capturing the flow of a conversation. It achieves the lowest WERs by using a long context length, e.g., 512 and 1024.

Finally, we compare the results of Llama2 and the domain-adapted Llama2. We can confirm that, unfortunately, domain adaptation does not bring further WER reduction. However, it shortens the context length needed with Llama2 to achieve the lowest WERs. This is a large advantage since the computational cost of an LLM heavily depends on the length of an input token sequence, and by using shorter context lengths, we can greatly reduce the computational cost. For example, the inference time when L=128𝐿128L{=}128italic_L = 128 is about 1/10 of that when L=1024𝐿1024L{=}1024italic_L = 1024. As reported in [12, 13], we also confirmed that recognition errors of infrequent words, such as “claustrophobic” and “octogenarians”, were reduced by using Llama2. Llama2 steadily reduces WERs from the strong ASR 1-best baseline, but there is still room for improvement since the lowest WERs obtained with Llama2 are much higher than those of the oracle hypotheses shown in the last row of Table 3.

Table 3: PPLs and N-best rescoring results in WERs obtained respectively with Llama2 and the domain-adapted Slama2/Llama2 of the eight context lengths L𝐿Litalic_L (17 models in total) on the CHiME-7 DASR task dataset. WERs lower than the baseline ASR 1-best WERs are underlined, and the lowest WERs for each dataset are shown in bold font. If the WER reduction from the ASR 1-best WER is statistically significant at the 5% / 1% level, the WER is annotated with “\ast” / “{\ast}{\ast}∗ ∗[49]. DiPCo is not included in the domain adaptation dataset (Table 1). Thus, the WER reductions on the DiPCo datasets are smaller than those on the CHiME-6 and Mixer 6 datasets.
CHiME-6 DiPCo Mixer 6
Dev Eval Dev Eval Dev Eval
Model Adapt L𝐿Litalic_L PPL WER PPL WER PPL WER PPL WER PPL WER PPL WER
ASR 1-best 23.0 26.2 27.7 25.5 13.8 15.8
Slama2-70M Full 0 48.3 22.8 48.3 26.2 48.4 27.8 45.6 25.8 46.3 14.0 45.4 15.9
16 44.4 22.8 41.2 26.1 44.6 27.7 41.3 25.7 41.8 14.0 42.1 15.8
32 41.9 22.8 38.3 26.0 42.7 27.7 39.4 25.6 39.9 14.0 40.3 15.8
64 39.5 22.8 36.0 26.0 40.9 27.7 37.3 25.6 37.9 14.0 38.3 15.8
128 37.6 22.8 34.2 26.0 39.5 27.7 35.7 25.6 36.2 13.9 36.6 15.8
256 36.4 22.8 32.9 26.0 38.5 27.7 34.6 25.6 35.1 13.9 35.4 15.7
512 35.7 22.7 32.2 25.9 38.0 27.6 34.1 25.6 34.4 13.9 34.7 15.7
1024 35.5 22.7 32.0 25.9 37.9 27.7 34.0 25.5 34.2 13.9 34.4 15.8
Llama2-7B 0 57.6 22.9 102.0 26.1 66.2 28.3 57.2 26.5 52.3 14.5 38.6 16.2
16 29.5 22.6 32.6 25.7 32.6 27.9 27.5 26.0 25.3 14.2 22.5 15.9
32 22.9 22.5 22.9 25.7 25.0 27.8 21.5 26.0 19.0 14.1 18.0 15.8
64 19.0 22.5 18.8 25.5∗∗ 20.4 27.8 17.3 25.8 15.4 13.9 15.0 15.7
128 16.8 22.4 16.5 25.4∗∗ 17.8 27.5 15.0 25.6 13.5 13.7 13.2 15.6
256 15.4 22.3∗∗ 15.1 25.4∗∗ 16.3 27.5 13.8 25.5 12.5 13.6 12.1 15.5
512 14.6 22.2∗∗ 14.1 25.3∗∗ 15.5 27.3 13.1 25.4 11.9 13.6 11.4 15.5
1024 14.1 22.2∗∗ 13.5 25.3∗∗ 15.0 27.3 12.7 25.3 11.6 13.5 11.1 15.4
QLoRA 0 20.9 22.4 25.4 25.7 23.5 27.6 20.4 25.7 19.5 13.7 16.9 15.5
16 18.4 22.3∗∗ 18.0 25.4∗∗ 20.2 27.5 17.2 25.6 15.2 13.6 14.3 15.5
32 16.8 22.2∗∗ 15.9 25.3∗∗ 18.5 27.4 15.6 25.5 13.8 13.6 13.1 15.5
64 15.5 22.2∗∗ 14.7 25.3∗∗ 17.2 27.4 14.4 25.4 12.7 13.6 12.3 15.4
128 14.6 22.2∗∗ 13.9 25.3∗∗ 16.1 27.4 13.5 25.3 12.0 13.5 11.7 15.4
256 14.1 22.2∗∗ 13.3 25.2∗∗ 15.4 27.4 13.0 25.3 11.6 13.5 11.3 15.4
512 13.6 22.2∗∗ 12.9 25.2∗∗ 15.0 27.3 12.6 25.3 11.4 13.5 11.0 15.4
1024 13.4 22.2∗∗ 12.6 25.2∗∗ 14.7 27.3 12.4 25.3 11.3 13.5 10.8 15.4
Oracle 16.6 17.2 19.3 18.0 8.8 11.6

4.3 Comparison of experimental settings

As described in Sections 3.1 and 3.4, we performed PPL evaluation on the CHiME-6 dev dataset to compare experimental settings with the following three aspects, i.e., (1) capitalize the first character of each sentence or not, (2) sort utterances in the conversational order or in the speaker-conditioned order [19], and (3) use Llama2-7B or Llama2-7B-Chat [3].

Table 4 shows the experimental results. The leftmost setting is our current experimental setting described in Section 4.1. First, we can confirm that, by capitalizing the first character of each sentence, the PPLs slightly get higher. This result indicates that capitalization is unnecessary thanks to the robust text processing ability of Llama2, or we need a more sophisticated approach for recovering the original text forms.

Next, we can confirm that, with the shorter context lengths, the speaker-conditioned order shows the lower PPLs than those of the conversational order, but with the longer context lengths, the trend is reversed. This result indicates that several consecutive utterances from one speaker have some consistency, while, in the longer contexts, the flow of a conversation becomes more dominant.

Finally, we can confirm that, by using Llama2-Chat, the PPLs get much higher. This result indicates that the style of the dialogue text datasets used to train Llama2-Chat may be very different from that of the CHiME-7 DASR task dataset. To summarize, our current setting described in Section 4.1 seems to be reasonable.

Table 4: Comparison results of the four experimental settings on the CHiME-6 dev dataset.
Llama2 version 7B 7B 7B 7B-Chat
Utterance order Conv Conv Spkr Conv
Capitalize the 1st char No Yes No No
Context length L𝐿Litalic_L === 0 57.6 69.1 57.6 86.6
16 29.5 31.6 28.4 41.2
32 22.9 24.0 22.7 32.3
64 19.0 19.9 19.4 26.3
128 16.8 17.7 17.6 22.4
256 15.4 16.2 16.4 20.0
512 14.6 15.0 15.6 18.5
1024 14.1 14.4 15.1 17.7

5 Conclusion and future work

We investigated the applicability of LLMs for rescoring ASR hypotheses of highly casual conversations by using Llama2 [3] and the CHiME-7 DASR task dataset [17]. Llama2 steadily reduces WERs from the strong ASR 1-best baseline mainly with the effect of context-carry over. Domain adaptation reduces the computational cost of Llama2 by shortening the needed context length. The experimental results and findings obtained in this study are informative for researchers in this field. Future work will include using larger Llama2, i.e., 13B and 70B [3], and backward LMs [50, 51, 52].

References

  • [1] OpenAI, “GPT-4 technical report,” arXiv:2303.08774v5 [cs.CL].
  • [2] Google, “PaLM 2 technical report,” arXiv:2305.10403v3 [cs.CL].
  • [3] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv:2307.09288v2 [cs.CL].
  • [4] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using BERT for speech recognition,” in Proc. PMLR, 2019, pp. 1081–1093.
  • [5] K. Li et al., “An empirical study of Transformer-based neural language model adaptation,” in Proc. ICASSP, 2020, pp. 7934–7938.
  • [6] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language model scoring,” in Proc. ACL, 2020, pp. 2699–2712.
  • [7] S.-H. Chiu and B. Chen, “Innovative BERT-based reranking language models for speech recognition,” in Proc. SLT, 2021, pp. 266–271.
  • [8] D. Fohr and I. Illina, “BERT-based semantic model for rescoring N-best speech recognition list,” in Proc. Interspeech, 2021, pp. 1867–1871.
  • [9] X. Zheng, C. Zhang, and P. C. Woodland, “Adapting GPT, GPT-2 and BERT language models for speech recognition,” in Proc. ASRU, 2021, pp. 162–168.
  • [10] H. Futami et al., “ASR rescoring and confidence estimation with ELECTRA,” in Proc. ASRU, 2021, pp. 380–387.
  • [11] L. Xu et al., “RescoreBERT: Discriminative speech recognition rescoring with BERT,” in Proc. ICASSP, 2022, pp. 6117–6121.
  • [12] T. Udagawa et al., “Effect and analysis of large-scale language model rescoring on competitive ASR systems,” in Proc. Interspeech, 2022, pp. 3919–3923.
  • [13] T. Chen et al., “Large-scale language model rescoring on long-form data,” in Proc. ICASSP, 2023.
  • [14] Y. Yu et al., “Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition,” in Proc. ASRU, 2023.
  • [15] Y. Li, Y. Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” in Proc. ASRU, 2023.
  • [16] P. G. Shivakumar et al., “Discriminative speech recognition rescoring with pre-trained language models,” in Proc. ASRU, 2023.
  • [17] S. Cornell1 et al., “The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenarios,” in Proc. CHiME 2023, 2023.
  • [18] R. Wang et al., “The USTC-NERCSLIP systems for the CHiME-7 DASR challenge,” in Proc. CHiME 2023, 2023.
  • [19] L. Ye et al., “The IACAS-Thinkit system for CHiME-7 challenge,” in Proc. CHiME 2023, 2023.
  • [20] N. Kamo et al., “NTT multi-speaker ASR system for the DASR task of CHiME-7 challenge,” in Proc. CHiME 2023, 2023.
  • [21] T. Prisyach et al., “STCON system for the CHiME-7 challenge,” in Proc. CHiME 2023, 2023.
  • [22] T. J. Park et al., “The CHiME-7 challenge: System description and performance of NeMo team’s DASR system,” in Proc. CHiME 2023, 2023.
  • [23] A. Vaswani et al., “Attention is all you need,” in Proc. NIPS, 2017, pp. 5998–6008.
  • [24] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” in Proc. NeurIPS, 2023.
  • [25] J. Barker et al., “CHiME challenges and workshops,” https://www.chimechallenge.org/.
  • [26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
  • [27] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv:1907.11692v1 [cs.CL].
  • [28] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: pre-training text encoders as discriminators rather than generators,” in Proc. ICLR, 2020.
  • [29] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI Technical Report, 2018.
  • [30] A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Technical Report, 2019.
  • [31] A. Chowdhery et al., “PaLM: Scaling language modeling with pathways,” arXiv:2204.02311v5 [cs.CL].
  • [32] H. Touvron et al., “LLaMA: Open and efficient foundation language models,” arXiv:2302.13971v1 [cs.CL].
  • [33] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in Proc. ICASSP, 1992, pp. I–517–I–520.
  • [34] J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in Proc. MLMI, 2006, pp. 28–39.
  • [35] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. ACL, 2016, pp. 1715–1725.
  • [36] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. EMNLP, 2018, pp. 66–71.
  • [37] E. Hu et al., “LoRA: Low-rank adaptation of large langauge models,” in Proc. ICLR, 2022.
  • [38] A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. NeurIPS, 2019, pp. 8024–8035.
  • [39] S. Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
  • [40] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proc. EMNLP, 2020, pp. 38–45.
  • [41] S. Mangrulkar et al., “PEFT: State-of-the-art parameter-efficient fine-tuning methods,” https://github.com/huggingface/peft, 2022.
  • [42] S. Watanabe et al., “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in Proc. of The 6th Intl. Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020.
  • [43] M. V. Segbroeck et al., “DiPCo - Dinner party corpus,” arXiv:1909.13447v1 [eess.AS].
  • [44] L. Brandschain et al., “The Mixer 6 Corpus: Resources for cross-channel and text independent speaker recognition,” in Proc. LREC, 2010, pp. 2441–2444.
  • [45] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2019.
  • [46] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  • [47] A. Gulati et al., “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, 2020.
  • [48] K. Miyazaki, M. Murata, and T. Koriyama, “Structured state space decoder for speech recognition and synthesis,” in Proc. ICASSP, 2023.
  • [49] S. Nakagawa and H. Takagi, “Statistical methods for comparing pattern recognition algorithms and comments on evaluating speech recognition performance,” Journal of the Acoustical Society of Japan, vol. 50, no. 10, pp. 849–854, October 1994.
  • [50] W. Xiong et al., “Toward human parity in conversational speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410–2423, Dec. 2017.
  • [51] K. Irie et al., “Investigation on estimation of sentence probability by combining forward, backward and bi-directional LSTM-RNNs,” in Proc. Interspeech, 2018, pp. 392–395.
  • [52] A. Ogawa, N. Tawara, M. Delcroix, and S. Araki, “Lattice rescoring based on large ensemble of complementary neural language models,” in Proc. ICASSP, 2022, pp. 6517–6521.