\interspeechcameraready\name

AtsunoriOgawa \nameNaoyukiKamo \nameKoheiMatsuura \nameTakanoriAshihara \nameTakafumiMoriya \name
TakatomoKano \nameNaohiroTawara \nameMarcDelcroix

Applying LLMs for rescoring N-best ASR hypotheses of casual conversations:
Effects of domain adaptation and context carry-over

Abstract

Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing $N$ -best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LLMs, and the CHiME-7 DASR task provides datasets of casual conversations between multiple participants. We investigate the effects of domain adaptation of the LLM and context carry-over when performing $N$ -best rescoring. Experimental results show that, even without domain adaptation, Llama2 outperforms a standard-size domain-adapted Transformer-LM, especially when using a long context. Domain adaptation shortens the context length needed with Llama2 to achieve its best performance, i.e., it reduces the computational cost of Llama2.

keywords:

speech recognition, casual conversation, large language model,

N

-best rescoring, domain adaptation, context carry-over

1 Introduction

Large language models (LLMs), such as GPT-4 [1], PaLM2 [2], and Llama2 (Large Language Model META AI) [3], have now become a prominent component in modern natural language processing (NLP) and are successfully utilized in various NLP tasks, such as machine translation, text summarization, and question answering. Recently, they have been used not only in NLP tasks but also in speech-related tasks, including automatic speech recognition (ASR). A simple way to utilize LLMs in ASR is using them in the second-pass rescoring (re-ranking) of multiple ASR hypotheses represented as an $N$ -best list or a lattice, which is obtained by the first-pass ASR decoding. Several studies have reported the usefulness of LLMs in $N$ -best or lattice rescoring of ASR hypotheses [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16].

Thanks to the significant progress of end-to-end (E2E) neural network modeling, the performance of ASR has greatly improved. Despite this significant progress, ASR accuracy remains unsatisfactory in some situations, such as performing ASR in daily-life environments [17, 18, 19, 20, 21, 22]. The distant ASR (DASR) task of the CHiME-7 challenge provides a dataset of such challenging situations [17]. The dataset contains casual conversations between multiple participants at real dinner parties. LMs can be expected to play an important role in ASR of such casual conversational speech, and most of the submitted systems try to use LMs during ASR decoding and/or for rescoring ASR hypotheses [19, 20, 21, 22]. However, the effect of using LMs is limited (the first-place system does not use any LMs [18]), and there is a demand for LMs to deal with such highly casual conversational speech.

As described above, several studies have successfully applied LLMs for rescoring ASR hypotheses [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. However, their targets are not casual conversations, and the ability of LLMs to rescore ASR hypotheses of casual conversations remains unclear (note that LLMs are not allowed to be used in the CHiME-7 challenge [17]). In this study, we reveal it by performing $N$ -best ASR hypotheses rescoring using Llama2-7B [3], which is one of the most representative Transformer [23] decoder-based causal LLMs, on the CHiME-7 DASR task. We comprehensively investigate the effects of domain adaptation of the LLM and context carry-over [9, 12, 13, 19] when performing $N$ -best rescoring. We employ QLoRA [24] for memory efficient domain adaptation and consider various context lengths (up to 1024 tokens) in context carry-over.

We conducted experiments, including experimental settings that have not been investigated in previous studies [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], and thus, the experimental results and findings obtained in this study are informative for researchers in this field (note that Llama2-7B is allowed to be used in the CHiME-8 challenge [25]). Our main findings can be summarized as follows.

•

Even without domain adaptation, Llama2 significantly outperforms a standard-size domain-adapted Transformer-LM.
•

Both domain adaptation and context carry-over improve the Llama2 performance.
•

Even without domain adaptation, by considering a very long context (e.g., 1024 tokens), Llama2 captures the flow of a conversation and achieves the lowest word error rate (WER), which is achieved with the domain-adapted Llama2.
•

Domain adaptation shortens the context length needed with Llama2 to achieve the lowest WER, significantly reducing the computational cost of Llama2.

2 Relation to prior work

Previous studies [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] use both Transformer encoder-based bidirectional LLMs, such as BERT [26], RoBERTa [27], and ELECTRA [28], and Transformer decoder-based unidirectional LLMs, such as GPT [29], GPT-2 [30], PaLM [31] and Llama1 [32], but focus more on the former encoder-based LLMs. In contrast, in this study, we focus on a decoder-based LLM, i.e., Llama2 [3], since recently released LLMs are mainly decoder-based, e.g., GPT-4 [1], PaLM2 [2], and Llama2, and we can expect their further progress.

Some previous studies [5, 7, 9, 11, 12, 14] use moderately conversational datasets, such as Switchboard (conversations on telephone calls) [33], AMI (conversations on meetings) [34], and an in-house dataset (conversations with a conversational agent) [11, 14]. In contrast, in this study, we use the CHiME-7 DASR task dataset (conversations at dinner parties) [17], which is much more casual and challenging than the above datasets, to reveal the applicability of LLMs for rescoring ASR hypotheses of highly casual conversations.

Considering past and future contexts is useful for rescoring current ASR hypotheses, and some previous studies perform context carry-over [9, 12, 13]. The past context is used with both encoder-based bidirectional LLMs and decoder-based unidirectional LLMs, while the future context is used only with encoder-based LLMs. In this study, we utilize only the past context since we use Llama2, but we comprehensively investigate the effect of the context length by varying it in a wide range, i.e., 0 (without considering the context) to 1024 tokens. The context length investigated in this study is much longer than that investigated in the previous studies, i.e., up to 180 tokens [9].

3 Models and methods

We introduce the LMs used in this study, the domain adaptation methods of the LMs, the $N$ -best rescoring method with context-carry over, and text preprocessing.

3.1 Language models

We use Llama2-7B [3] as the main LLM. As a competitor, we also prepared a standard-size Transformer-LM. We used the Llama2 tokenizer (its vocabulary size is 32k BPE [35, 36] tokens) as that of the standard-size Transformer-LM, and thus, we can fairly compare these two models in terms of perplexity (PPL). To build the standard-size Transformer-LM, we first copied the configuration of Llama2-7B and edited it to define a downsized model structure, and then we trained the configurated model from scratch using a text dataset. The model size (number of model parameters) is about 70M, i.e., 1/100 of the Llama2-7B size, which is the standard size of a Transformer-LM. This model inherits the configuration of Llama2-7B, and thus, in this study, we refer to it as Slama2-70M, i.e., Standard-size (or Smaller-size) of Llama2. Details of Slama2-70M are described in Section 4.1.

We also use Llama2-7B-Chat, which is a fine-tuned version of Llama2-7B that is optimized for dialogue use cases [3], since it may be more suitable than the base Llama2-7B for rescoring ASR hypotheses of casual conversation. We investigate which model is more suitable for the target in Section 4.3.

3.2 Domain adaptation

Llama2 is trained using massive text datasets and is expected to have general linguistic knowledge. However, conversations contained in the CHiME-7 DASR task dataset are highly casual, and thus, transcriptions of such conversations may not be included in the Llama2 training text datasets (their details are not opened [3]). We employ QLoRA [24] to adapt Llama2 to the target casual conversational domain with its memory efficient way. With QLoRA, a 4-bit quantized large number of the LLM parameters are frozen, while a small number of low-rank adapters (LoRA) [37] are fine-tuned using a smaller-size target-domain text dataset. As regards domain adaptation of Slama2, we perform full parameter fine-tuning. Details of domain adaptation are described in Section 4.1.

3.3 N-best rescoring with context carry-over

Let ${\mathbf{X}}_{i}$ be a feature vector sequence of the $i$ th utterance in an input utterance sequence. As the first-pass ASR decoding, an E2E ASR model decodes ${\mathbf{X}}_{i}$ and outputs $N$ -best ASR hypotheses (an $N$ -best list) of the input utterance as $\{{\mathbf{w}}_{i}^{r}\}_{r=1}^{N}$ , where ${\mathbf{w}}_{i}^{r}$ is the $r$ th rank hypothesis (token sequence). The ASR model provides the score (log-probability) for each of the $N$ -best hypotheses as $\{\log{P_{\mathtt{asr}}}({\mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})\}_{r=1}^{N}$ .

Then, as the second-pass post-processing, we perform $N$ -best rescoring. We first calculate the LM score (log-probability) for each of the $N$ -best hypotheses as $\{\log{P_{\mathtt{lm}}}({\mathbf{w}}_{i}^{r})\}_{r=1}^{N}$ using an LM. Next, for each rank, i.e., $r=1,{\cdots},N$ , we combine the ASR and LM scores as,

{\log}{P}({\mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})=\log{P_{\mathtt{asr}}}({% \mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})+\alpha\log{P_{\mathtt{lm}}}({\mathbf{w}}% _{i}^{r})+\gamma\lvert{{\mathbf{w}}_{i}^{r}}\rvert,

(1)

where $\alpha$ ( $\alpha\geq 0$ ) is the language weight and $\gamma$ ( $\gamma\geq 0$ ) is the reward that is given proportional to the length of ${\mathbf{w}}_{i}^{r}$ . Lastly, we select the best (the highest score rank) hypothesis based on the combined score ${\log}P({\mathbf{w}}_{i}^{r}|{\mathbf{X}}_{i})$ in Eq. (1) as the final 1-best ASR hypothesis.

In the above basic $N$ -best rescoring procedure, we focus on the current hypotheses. However, considering the past hypotheses sequence as the context is effective for rescoring the current hypotheses, especially for the conversational speech case. In this study, as with some previous studies [9, 12, 13, 19], we perform context carry-over in $N$ -best rescoring. To consider the context, we modify the LM score in Eq. (1) as,

\log{P_{\mathtt{lm}}}({\mathbf{w}}_{i}^{r})\rightarrow\log{P_{\mathtt{lm}}}({% \mathbf{w}}_{i}^{r}|{\mathbf{w}}_{-L:-1}^{{\mathtt{best}}}),

(2)

where ${\mathbf{w}}_{-L:-1}^{{\mathtt{best}}}$ is the best past context (token sequence) of the length (number of tokens) $L$ obtained by $N$ -best rescoring for the past $N$ -best hypotheses sequence. Note that, in this study, we do not care about the hypothesis (utterance) boundaries, i.e., the best past context can start from the middle of a past 1-best hypothesis. Note also that, as with $N$ -best rescoring, we can perform PPL calculation with context-carry over. We comprehensively investigate the effect of the context length $L$ by varying it in a wide range in Section 4.2.

3.4 Text processing

The authors of [19], who submitted the second-place system of the CHiME-7 challenge, ordered utterances (sentences) in the training text dataset as, speaker 1’s utterance 1, utterance 2, …, speaker 2’s utterance 1, utterance 2, …, and trained an LM (they performed $N$ -best rescoring by applying the same ordering to ASR hypotheses). This speaker-conditioned ordering is based on the assumption that utterances from one speaker have some consistency, and, within the speaker, the past utterances are useful in predicting the current utterance. However, this ordering ignores the flow of a conversation. We investigate which of the speaker-conditioned order or the conversational order is more suitable for the CHiME-7 DASR task in Section 4.3.

Llama2 is trained using texts that preserve their original forms [32, 3], i.e., the texts preserve capitalized characters and symbols, such as commas, periods, (double) quotations, (semi-) colons, question/exclamation marks, and so on. In contrast, texts used in the ASR research field, including texts in the CHiME-7 DASR task dataset, are usually heavily normalized, i.e., all the characters in the texts are lowercased, and all the symbols are removed from the texts. It is not clear whether Llama2 can appropriately treat these heavily normalized texts. However, what we can do to recover the original texts is limited. In this study, we add a period for each sentence (or hypothesis in $N$ -best rescoring). What else we can do is capitalize the first character for each sentence (but it is difficult to recover other capitalization, e.g., named entities). We investigate whether this capitalization of the first character is effective for Llama2 in Section 4.3.

4 Experiments

We conducted $N$ -best rescoring experiments using the CHiME-7 DASR task dataset [17] on the PyTorch [38] environment. We used ESPnet [39] for ASR model training and decoding. We also used Hugging Face Transformers [40] with the PEFT library [41] for LM training, domain adaptation, and inference.

4.1 Experimental settings

The CHiME-7 DASR task dataset [17] consists of the three datasets, i.e., CHiME-6 [42], DiPCo [43], and Mixer 6 [44]. The former two datasets contain conversations between four participants at real dinner parties, while Mixer 6 contains conversations between an interviewer and a subject. CHiME-6 and Mixer 6 have the training, development (dev), and evaluation (eval) data splits, while DiPCo has the dev and eval data splits. We used the CHiME-6 and Mixer 6 (CH6+Mx6) combined training dataset for LM domain adaptation, the CHiME-6 dev dataset for hyperparameter tuning, and all the dev and eval datasets for evaluation. Table 1 shows details of these datasets, and further details can be found in [17, 42, 43, 44]. As described in Section 3.4, we sorted all the sentences (utterances) in these datasets in the conversational order (not the speaker-conditioned order [19]) and added a period for each sentence (but we did not perform any capitalization).

For domain adaptation of Llama2, we attached LoRA adapters [37] to all the query and value projection matrices in the attention modules of Llama2 and fine-tuned them with QLoRA [24] (Section 3.2) using the CH6+Mx6 training dataset shown in Table 1. The ratio of the number of trainable parameters against that of all parameters was 0.06%. We set the context length (number of tokens) $L$ in Eq. (2) at 0, 16, 32, 64, 128, 256, 512, and 1024, respectively. For each of these context lengths $L$ , we concatenated past $L$ tokens as the context to all the sentences in the dataset and performed fine-tuning. We performed one epoch QLoRA fine-tuning using the AdamW optimizer [45] by setting the LoRA rank, LoRA alpha scaling parameter, LoRA dropout probability, batch size, and learning rate at 8, 16, 0.05, 64, 1e-5, respectively. As a result, we obtained eight domain-adapted Llama2 models.

Table 2 shows the configuration of Slama2-70B (Section 3.1) in comparison with that of Llama2-7B [3]. We trained Slama2 using 1.1G tokens of the LibriSpeech text dataset [46]. We concatenated all the sentences (token sequences) in the dataset to form one long token sequence and split it into token sequences of length 2048, which is the maximum positional embedding length of Slama2, as shown in Table 2. We trained Slama2 from scratch using these token sequences and then performed domain adaptation of it. For each of the eight context lengths $L$ , we applied the same text processing described above to the CH6+Mx6 training dataset and performed fine-tuning of Slama2 using the dataset. We performed one epoch full parameter fine-tuning using the AdamW optimizer by setting the batch size and learning rate at 64 and 5e-6, respectively. As a result, we obtained eight domain-adapted Slama2 models.

As the E2E ASR model, we trained a competitive model based on a Conformer-encoder [47] and a structured state space (S4) decoder [48], which is used in the third-place system [20] of the CHiME-7 challenge. Using this ASR model, we performed ASR for all the dev and eval utterances and generated 32-best ASR hypotheses for each of the utterances. We did not use any LMs in ASR decoding. As with the above-described text processing, we sorted the ASR hypotheses in the conversational order and added a period for each hypothesis. Then, using Llama2, the domain-adapted Slama2/Llama2 of the eight context lengths $L$ (17 models in total), respectively, we performed rescoring for the 32-best ASR hypotheses. When using Llama2, we set the language weight $\alpha$ and the reward $\gamma$ in Eq. (1) at 0.4 and 0.5, respectively, and when using Slama2, we set them at 0.3 and 0.5, respectively. We optimized these values using the CHiME-6 dev dataset. We also performed token-based PPL evaluation for all the dev and eval transcriptions (correct token sequences).

Table 1: Details of the CHiME-7 DASR task dataset. The numbers of words and tokens are counted using the manual transcriptions (correct sentences). However, we can obtain almost the same numbers with ASR hypotheses. # tokens per word

\simeq

1.5 for all the datasets. For example, in the case of the CHiME-6 dev dataset, the context length L

=

1024 tokens corresponds to about 76 utterances (1024 / 13.4

\simeq

76).

	CH6+Mx6		CHiME-6
	Training		Dev	Eval
# utts (# sents)	120k		6.6k	18.2k
# words	994k		58.9k	101k
# tokens	1.48M		89.1k	164k
# words per utt	8.3		8.9	5.5
# tokens per utt	12.4		13.4	9.0
	DiPCo		Mixer 6
	Dev	Eval	Dev	Eval
# utts (# sents)	3.7k	3.4k	14.8k	5.1k
# words	30.0k	28.8k	149k	69.3k
# tokens	45.9k	43.2k	215k	96.1k
# words per utt	8.2	8.5	10.1	13.6
# tokens per utt	12.5	12.7	14.5	18.8

Table 2: Configurations of Llama2-7B and Slama2-70M.

	Llama2-7B	Slama2-70M
Number of hidden layers	32	8
Hidden size	4096	512
Number of attention heads	32	8
Intermediate (FFN) size	11008	2048
Max positional embeddings	4096	2048

4.2 Results of PPL evaluation and N-best rescoring

Table 3 shows the results of PPL evaluation and $N$ -best rescoring. First, we can confirm that, in some cases, the domain-adapted Slama2 reduces the word error rates (WERs) from the strong ASR 1-best baseline. The longer contexts bring the lower WERs (and PPLs). However, the reduction of the WERs is limited, as reported in the CHiME-7 papers [19, 20, 21, 22].

Next, we compare the results of Slama2 and Llama2 without domain adaptation. We can confirm that, with the shorter context lengths (especially when $L{=}0$ ), Llama2 underperforms Slama2. However, its performance is quickly improved by considering longer contexts, i.e., by capturing the flow of a conversation. It achieves the lowest WERs by using a long context length, e.g., 512 and 1024.

Finally, we compare the results of Llama2 and the domain-adapted Llama2. We can confirm that, unfortunately, domain adaptation does not bring further WER reduction. However, it shortens the context length needed with Llama2 to achieve the lowest WERs. This is a large advantage since the computational cost of an LLM heavily depends on the length of an input token sequence, and by using shorter context lengths, we can greatly reduce the computational cost. For example, the inference time when $L{=}128$ is about 1/10 of that when $L{=}1024$ . As reported in [12, 13], we also confirmed that recognition errors of infrequent words, such as “claustrophobic” and “octogenarians”, were reduced by using Llama2. Llama2 steadily reduces WERs from the strong ASR 1-best baseline, but there is still room for improvement since the lowest WERs obtained with Llama2 are much higher than those of the oracle hypotheses shown in the last row of Table 3.

Table 3: PPLs and N-best rescoring results in WERs obtained respectively with Llama2 and the domain-adapted Slama2/Llama2 of the eight context lengths

L

(17 models in total) on the CHiME-7 DASR task dataset. WERs lower than the baseline ASR 1-best WERs are underlined, and the lowest WERs for each dataset are shown in bold font. If the WER reduction from the ASR 1-best WER is statistically significant at the 5% / 1% level, the WER is annotated with “

\ast

” / “

{\ast}{\ast}

” [49]. DiPCo is not included in the domain adaptation dataset (Table 1). Thus, the WER reductions on the DiPCo datasets are smaller than those on the CHiME-6 and Mixer 6 datasets.

			CHiME-6				DiPCo				Mixer 6
			Dev		Eval		Dev		Eval		Dev		Eval
Model	Adapt	$L$	PPL	WER	PPL	WER	PPL	WER	PPL	WER	PPL	WER	PPL	WER
ASR 1-best	—	—	—	23.0	—	26.2	—	27.7	—	25.5	—	13.8	—	15.8
Slama2-70M	Full	0	48.3	22.8	48.3	26.2	48.4	27.8	45.6	25.8	46.3	14.0	45.4	15.9
		16	44.4	22.8	41.2	26.1	44.6	27.7	41.3	25.7	41.8	14.0	42.1	15.8
		32	41.9	22.8	38.3	26.0	42.7	27.7	39.4	25.6	39.9	14.0	40.3	15.8
		64	39.5	22.8	36.0	26.0	40.9	27.7	37.3	25.6	37.9	14.0	38.3	15.8
		128	37.6	22.8	34.2	26.0	39.5	27.7	35.7	25.6	36.2	13.9	36.6	15.8
		256	36.4	22.8	32.9	26.0	38.5	27.7	34.6	25.6	35.1	13.9	35.4	15.7
		512	35.7	22.7	32.2	25.9	38.0	27.6	34.1	25.6	34.4	13.9	34.7	15.7
		1024	35.5	22.7	32.0	25.9	37.9	27.7	34.0	25.5	34.2	13.9	34.4	15.8
Llama2-7B	—	0	57.6	22.9	102.0	26.1	66.2	28.3	57.2	26.5	52.3	14.5	38.6	16.2
		16	29.5	22.6	32.6	25.7^∗	32.6	27.9	27.5	26.0	25.3	14.2	22.5	15.9
		32	22.9	22.5^∗	22.9	25.7^∗	25.0	27.8	21.5	26.0	19.0	14.1	18.0	15.8
		64	19.0	22.5^∗	18.8	25.5^∗∗	20.4	27.8	17.3	25.8	15.4	13.9	15.0	15.7
		128	16.8	22.4^∗	16.5	25.4^∗∗	17.8	27.5	15.0	25.6	13.5	13.7	13.2	15.6
		256	15.4	22.3^∗∗	15.1	25.4^∗∗	16.3	27.5	13.8	25.5	12.5	13.6	12.1	15.5
		512	14.6	22.2^∗∗	14.1	25.3^∗∗	15.5	27.3	13.1	25.4	11.9	13.6	11.4	15.5
		1024	14.1	22.2^∗∗	13.5	25.3^∗∗	15.0	27.3	12.7	25.3	11.6	13.5^∗	11.1	15.4^∗
	QLoRA	0	20.9	22.4^∗	25.4	25.7^∗	23.5	27.6	20.4	25.7	19.5	13.7	16.9	15.5
		16	18.4	22.3^∗∗	18.0	25.4^∗∗	20.2	27.5	17.2	25.6	15.2	13.6	14.3	15.5
		32	16.8	22.2^∗∗	15.9	25.3^∗∗	18.5	27.4	15.6	25.5	13.8	13.6	13.1	15.5
		64	15.5	22.2^∗∗	14.7	25.3^∗∗	17.2	27.4	14.4	25.4	12.7	13.6	12.3	15.4^∗
		128	14.6	22.2^∗∗	13.9	25.3^∗∗	16.1	27.4	13.5	25.3	12.0	13.5^∗	11.7	15.4^∗
		256	14.1	22.2^∗∗	13.3	25.2^∗∗	15.4	27.4	13.0	25.3	11.6	13.5^∗	11.3	15.4^∗
		512	13.6	22.2^∗∗	12.9	25.2^∗∗	15.0	27.3	12.6	25.3	11.4	13.5^∗	11.0	15.4^∗
		1024	13.4	22.2^∗∗	12.6	25.2^∗∗	14.7	27.3	12.4	25.3	11.3	13.5^∗	10.8	15.4^∗
Oracle	—	—	—	16.6	—	17.2	—	19.3	—	18.0	—	8.8	—	11.6

4.3 Comparison of experimental settings

As described in Sections 3.1 and 3.4, we performed PPL evaluation on the CHiME-6 dev dataset to compare experimental settings with the following three aspects, i.e., (1) capitalize the first character of each sentence or not, (2) sort utterances in the conversational order or in the speaker-conditioned order [19], and (3) use Llama2-7B or Llama2-7B-Chat [3].

Table 4 shows the experimental results. The leftmost setting is our current experimental setting described in Section 4.1. First, we can confirm that, by capitalizing the first character of each sentence, the PPLs slightly get higher. This result indicates that capitalization is unnecessary thanks to the robust text processing ability of Llama2, or we need a more sophisticated approach for recovering the original text forms.

Next, we can confirm that, with the shorter context lengths, the speaker-conditioned order shows the lower PPLs than those of the conversational order, but with the longer context lengths, the trend is reversed. This result indicates that several consecutive utterances from one speaker have some consistency, while, in the longer contexts, the flow of a conversation becomes more dominant.

Finally, we can confirm that, by using Llama2-Chat, the PPLs get much higher. This result indicates that the style of the dialogue text datasets used to train Llama2-Chat may be very different from that of the CHiME-7 DASR task dataset. To summarize, our current setting described in Section 4.1 seems to be reasonable.

Table 4: Comparison results of the four experimental settings on the CHiME-6 dev dataset.

Llama2 version	7B	7B	7B	7B-Chat
Utterance order	Conv	Conv	Spkr	Conv
Capitalize the 1st char	No	Yes	No	No
Context length $L$ $=$ 0	57.6	69.1	57.6	86.6
16	29.5	31.6	28.4	41.2
32	22.9	24.0	22.7	32.3
64	19.0	19.9	19.4	26.3
128	16.8	17.7	17.6	22.4
256	15.4	16.2	16.4	20.0
512	14.6	15.0	15.6	18.5
1024	14.1	14.4	15.1	17.7

5 Conclusion and future work

We investigated the applicability of LLMs for rescoring ASR hypotheses of highly casual conversations by using Llama2 [3] and the CHiME-7 DASR task dataset [17]. Llama2 steadily reduces WERs from the strong ASR 1-best baseline mainly with the effect of context-carry over. Domain adaptation reduces the computational cost of Llama2 by shortening the needed context length. The experimental results and findings obtained in this study are informative for researchers in this field. Future work will include using larger Llama2, i.e., 13B and 70B [3], and backward LMs [50, 51, 52].

References

[1] OpenAI, “GPT-4 technical report,” arXiv:2303.08774v5 [cs.CL].
[2] Google, “PaLM 2 technical report,” arXiv:2305.10403v3 [cs.CL].
[3] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv:2307.09288v2 [cs.CL].
[4] J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using BERT for speech recognition,” in Proc. PMLR, 2019, pp. 1081–1093.
[5] K. Li et al., “An empirical study of Transformer-based neural language model adaptation,” in Proc. ICASSP, 2020, pp. 7934–7938.
[6] J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language model scoring,” in Proc. ACL, 2020, pp. 2699–2712.
[7] S.-H. Chiu and B. Chen, “Innovative BERT-based reranking language models for speech recognition,” in Proc. SLT, 2021, pp. 266–271.
[8] D. Fohr and I. Illina, “BERT-based semantic model for rescoring N-best speech recognition list,” in Proc. Interspeech, 2021, pp. 1867–1871.
[9] X. Zheng, C. Zhang, and P. C. Woodland, “Adapting GPT, GPT-2 and BERT language models for speech recognition,” in Proc. ASRU, 2021, pp. 162–168.
[10] H. Futami et al., “ASR rescoring and confidence estimation with ELECTRA,” in Proc. ASRU, 2021, pp. 380–387.
[11] L. Xu et al., “RescoreBERT: Discriminative speech recognition rescoring with BERT,” in Proc. ICASSP, 2022, pp. 6117–6121.
[12] T. Udagawa et al., “Effect and analysis of large-scale language model rescoring on competitive ASR systems,” in Proc. Interspeech, 2022, pp. 3919–3923.
[13] T. Chen et al., “Large-scale language model rescoring on long-form data,” in Proc. ICASSP, 2023.
[14] Y. Yu et al., “Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition,” in Proc. ASRU, 2023.
[15] Y. Li, Y. Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” in Proc. ASRU, 2023.
[16] P. G. Shivakumar et al., “Discriminative speech recognition rescoring with pre-trained language models,” in Proc. ASRU, 2023.
[17] S. Cornell1 et al., “The CHiME-7 DASR challenge: Distant meeting transcription with multiple devices in diverse scenarios,” in Proc. CHiME 2023, 2023.
[18] R. Wang et al., “The USTC-NERCSLIP systems for the CHiME-7 DASR challenge,” in Proc. CHiME 2023, 2023.
[19] L. Ye et al., “The IACAS-Thinkit system for CHiME-7 challenge,” in Proc. CHiME 2023, 2023.
[20] N. Kamo et al., “NTT multi-speaker ASR system for the DASR task of CHiME-7 challenge,” in Proc. CHiME 2023, 2023.
[21] T. Prisyach et al., “STCON system for the CHiME-7 challenge,” in Proc. CHiME 2023, 2023.
[22] T. J. Park et al., “The CHiME-7 challenge: System description and performance of NeMo team’s DASR system,” in Proc. CHiME 2023, 2023.
[23] A. Vaswani et al., “Attention is all you need,” in Proc. NIPS, 2017, pp. 5998–6008.
[24] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” in Proc. NeurIPS, 2023.
[25] J. Barker et al., “CHiME challenges and workshops,” https://www.chimechallenge.org/.
[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[27] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv:1907.11692v1 [cs.CL].
[28] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: pre-training text encoders as discriminators rather than generators,” in Proc. ICLR, 2020.
[29] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” OpenAI Technical Report, 2018.
[30] A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI Technical Report, 2019.
[31] A. Chowdhery et al., “PaLM: Scaling language modeling with pathways,” arXiv:2204.02311v5 [cs.CL].
[32] H. Touvron et al., “LLaMA: Open and efficient foundation language models,” arXiv:2302.13971v1 [cs.CL].
[33] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in Proc. ICASSP, 1992, pp. I–517–I–520.
[34] J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in Proc. MLMI, 2006, pp. 28–39.
[35] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proc. ACL, 2016, pp. 1715–1725.
[36] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proc. EMNLP, 2018, pp. 66–71.
[37] E. Hu et al., “LoRA: Low-rank adaptation of large langauge models,” in Proc. ICLR, 2022.
[38] A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. NeurIPS, 2019, pp. 8024–8035.
[39] S. Watanabe et al., “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
[40] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proc. EMNLP, 2020, pp. 38–45.
[41] S. Mangrulkar et al., “PEFT: State-of-the-art parameter-efficient fine-tuning methods,” https://github.com/huggingface/peft, 2022.
[42] S. Watanabe et al., “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in Proc. of The 6th Intl. Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020.
[43] M. V. Segbroeck et al., “DiPCo - Dinner party corpus,” arXiv:1909.13447v1 [eess.AS].
[44] L. Brandschain et al., “The Mixer 6 Corpus: Resources for cross-channel and text independent speaker recognition,” in Proc. LREC, 2010, pp. 2441–2444.
[45] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. ICLR, 2019.
[46] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
[47] A. Gulati et al., “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech, 2020.
[48] K. Miyazaki, M. Murata, and T. Koriyama, “Structured state space decoder for speech recognition and synthesis,” in Proc. ICASSP, 2023.
[49] S. Nakagawa and H. Takagi, “Statistical methods for comparing pattern recognition algorithms and comments on evaluating speech recognition performance,” Journal of the Acoustical Society of Japan, vol. 50, no. 10, pp. 849–854, October 1994.
[50] W. Xiong et al., “Toward human parity in conversational speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410–2423, Dec. 2017.
[51] K. Irie et al., “Investigation on estimation of sentence probability by combining forward, backward and bi-directional LSTM-RNNs,” in Proc. Interspeech, 2018, pp. 392–395.
[52] A. Ogawa, N. Tawara, M. Delcroix, and S. Araki, “Lattice rescoring based on large ensemble of complementary neural language models,” in Proc. ICASSP, 2022, pp. 6517–6521.

Applying LLMs for rescoring N-best ASR hypotheses of casual conversations: Effects of domain adaptation and context carry-over