License: CC BY 4.0
arXiv:2401.10446v1 [cs.CL] 19 Jan 2024

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Yuchen Hu,11{}^{{\dagger},1}\thanks{Equal contribution. ${\dagger}$Corresponding authors: % {[email protected], [email protected]}}start_FLOATSUPERSCRIPT † , 1 end_FLOATSUPERSCRIPT Chen Chen1*1{}^{1*}start_FLOATSUPERSCRIPT 1 * end_FLOATSUPERSCRIPT Chao-Han Huck Yang2,3,23{}^{2,3,{\dagger}}start_FLOATSUPERSCRIPT 2 , 3 , † end_FLOATSUPERSCRIPT
Ruizhe Li44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTChao Zhang55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTPin-Yu Chen66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPT Eng Siong Chng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTNanyang Technological University  22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTGeorgia Institute of Technology  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTNVIDIA Research
44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTUniversity of Aberdeen  55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTTsinghua University  66{}^{6}start_FLOATSUPERSCRIPT 6 end_FLOATSUPERSCRIPTMIT-IBM Waston AI Lab
Equal contribution. {\dagger}Corresponding authors: [email protected], [email protected]
Abstract

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with ``HyPoradise'' dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising111This work is open sourced at: https://github.com/YUCHEN005/RobustGER.

1 Introduction

Recent advances in large language models (LLMs) have attracted a surge of research interest due to their representation power of language generation (OpenAI, 2022; 2023; Touvron et al., 2023a), which achieve a wide range of success on natural language processing (NLP) tasks (Brown et al., 2020; Wei et al., 2022; Ouyang et al., 2022). Powered by LLMs, latest works (Chen et al., 2023b; Yang et al., 2023a) propose a generative error correction (GER) framework222https://github.com/Hypotheses-Paradise/Hypo2Trans for automatic speech recognition (ASR), along with a ``HyPoradise'' dataset333https://huggingface.co/datasets/PeacefulData/Robust-HyPoradise that contains abundant pairs of ASR N-best hypotheses and ground-truth transcription. It has shown great performance in learning the map** from hypotheses to transcription by parameter-efficient LLM finetuning (Hu et al., 2021), which significantly outperforms typical LM rescoring methods (Mikolov et al., 2010). However, their study lacks specificity on noisy ASR scenarios, which are the most common in real world (Li et al., 2015).

In this work, we extend the GER benchmark to noisy conditions, as well as propose a Robust HyPoradise (RobustHP) dataset with 113K hypotheses-transcription pairs from various ASR corpus in common noisy scenarios. Similar to the original benchmark, we also observe error correction improvement of LLM finetuning on noisy ASR, but the performance gain in most noisy conditions is still limited (see Table 1). It indicates that LLMs-based GER is still prone to source audio noise (see our case study in Table 5). Luckily, we draw inspiration from the noise-robust ASR community. Their key idea is to map noisy speech features to clean space (i.e., denoise) before recognition (Li et al., 2014), where speech enhancement denoising (Pandey et al., 2021) is one of the most popular approaches. Therefore, we raise a research question for our case: Can we teach LLMs to denoise the N-best hypotheses for GER, just like what robust ASR and speech enhancement do?

Refer to caption
Figure 1: Overview of (a) GER (Chen et al., 2023b; Yang et al., 2023a), (b) GER with audio-space denoising (Zhang et al., 2023b) (see details in §B.1), (c) GER with language-space denoising.

Inspired by recent works on LLM adaptation (Wu et al., 2023a; Fathullah et al., 2023; Gao et al., 2023), a general solution here is to incorporate audio noise information as a conditioner into LLM finetuning to make it noise-aware, which is also similar to the popular conditional diffusion model (Dhariwal & Nichol, 2021). However, latest works find that directly introducing other modalities (e.g., audio, visual) into LLM finetuning could harm its stability and performance due to cross-modality gap (Zhang et al., 2023b; Li et al., 2023b). Our examination in Table 1 also indicates this limitation.

To this end, we propose to extract a noise embedding in language space to represent the noise conditions of source speech, by measuring the diversity of N-best hypotheses list from ASR decoding. The insight behind is that, the worse noisy conditions (more challenging noise type or lower SNR), the higher uncertainty of ASR beam search decoding, and thus results in more diverse N-best hypotheses, which has been illustrated in Table 15 and Fig 6. Extracted from the language space of hypotheses instead of audio space, our noise embedding can be well incorporated into LLM tuning to improve GER, which can be viewed as a novel language-space denoising process. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation (Belghazi et al., 2018) to distill the real noise information in audio embeddings to our extracted language embedding. As a result, it presents stronger noise representativeness (see Fig. 4(b)) and enhances the denoising performance. Various latest LLMs (e.g., LLaMA-2 (Touvron et al., 2023b), LLaMA (Touvron et al., 2023a) and Falcon (Penedo et al., 2023)) are utilized to verify the effectiveness of our approach, and the comprehensive experimental results demonstrate that our model improves the GER performance with up to 53.9% word error rate (WER) reduction on RobustHP test sets while with limited training data.

Our contribution can be summarized as follows:

  • We extend the latest ASR generative error correction benchmark to noise-robust ASR, where a Robust HyPoradise (RobustHP) dataset with 113K hypotheses-transcription pairs is collected from various ASR corpus in common noisy conditions.

  • We propose RobustGER, a noise-aware generative error correction approach based on LLMs to map N-best hypotheses to true transcription, where an extracted language-space noise embedding with audio distillation is utilized to teach LLMs to perform denoising.

  • Experiments on various latest LLMs show the proposed approach achieves a new breakthrough on RobustHP with up to 53.9% GER improvement in terms of word error rate (WER). Analysis verifies the effectiveness of our proposed language-space embedding to represent audio noise, under which LLMs show strong ability of language-space denoising.

2 Related Work

Large Language Models and Parameter-efficient Finetuning. There is recently a surge of research interests in Transformer-based LLMs, such as ChatGPT (OpenAI, 2022), GPT-4 (OpenAI, 2023) and LLaMA (Touvron et al., 2023a). Benefiting from giant model size and abundant training data, LLMs can understand the linguistic structures and semantic meanings behind text, which shows remarkable performance on a wide range of NLP tasks (Brown et al., 2020; Wei et al., 2022; Ouyang et al., 2022). To adapt LLMs to downstream tasks, many recent works investigate parameter-efficient LLM finetuning (Hu et al., 2021) considering its huge model size. In order to further exploit the potential of LLMs on multimodal tasks, more recent works investigate to incorporate other modalities into LLM tuning (Wu et al., 2023a; Fathullah et al., 2023; Li et al., 2023a; Chen et al., 2023c; Zhang et al., 2023a; b; Gao et al., 2023; Wang et al., 2023; Radhakrishnan et al., 2023). However, the latest works find that directly introducing other modalities into LLMs could harm the finetuning stability and performance due to the heterogeneous cross-modality gap (Zhang et al., 2023b; Li et al., 2023b). Therefore, this work proposes to extract a language embedding from the N-best list to represent audio noise, which works well in teaching LLMs to perform denoising.

LM Rescoring and ASR Generative Error Correction. LM rescoring has been widely used in ASR decoding to improve the linguistic acceptability of recognition results, which achieves stable gains of ASR performance (Arisoy et al., 2015; Shin et al., 2019; Mikolov et al., 2010; Yang et al., 2021; Yu et al., 2023). Typically, an external LM is deployed to rescore the N-best hypotheses list from ASR beam search decoding to rerank out the 1-best candidature. Furthermore, to make full use of all candidatures, recent works use the entire N-best list for error correction (Leng et al., 2021; Ma et al., 2023; Hu et al., 2020; 2023; Guo et al., 2019; Hu et al., 2022; Chen et al., 2023a), which outperforms rescoring methods. Powered by LLMs, the latest works propose generative error correction (GER) benchmark (Chen et al., 2023b) to directly predict the ground-truth transcription from ASR N-best hypotheses. To enable the learning of hypotheses-to-transcription map**, they also propose a HyPoradise dataset with 316K hypotheses-transcription pairs. This work extends the GER benchmark to the most common noisy ASR scenarios with a new Robust HyPoradise dataset.

Noise-robust ASR. Neural ASR has achieved human-level performance but its noise-robustness in the real world remains a challenge (Krishna et al., 2019). Recent noise-robust ASR methods make some progress by map** noisy speech features to clean space (i.e., denoise) before recognition (Li et al., 2014). For instance, speech enhancement serves as a denoising front-end (Fu et al., 2019) to improve speech quality for ASR (Pandey et al., 2021), domain adversarial training aims to learn noise-invariant speech features (Prasad et al., 2021), and the recent ASR foundation model uses web-scale data and various preprocessing steps for denoising (Radford et al., 2023). Inspired by them, this work investigates to teach LLMs to denoise the N-best hypotheses in language space for GER.

3 Benchmark and Dataset

3.1 Generative Error Correction Benchmark

We extend original generative error correction benchmark (Chen et al., 2023b) to noise-robust ASR. Given an input noisy speech Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the pre-trained ASR model first transcribe it into N𝑁Nitalic_N-best hypotheses 𝒴N={Y1,Y2,,YN}subscript𝒴𝑁subscript𝑌1subscript𝑌2subscript𝑌𝑁\mathcal{Y}_{N}=\{Y_{1},Y_{2},\cdots,Y_{N}\}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } by beam search decoding. The goal of GER is to learn a hypotheses-to-transcription (H2T) map** H2TsubscriptH2T\mathcal{M}_{\text{H2T}}caligraphic_M start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT that predicts the transcription Y𝑌Yitalic_Y based on N𝑁Nitalic_N-best list 𝒴Nsubscript𝒴𝑁\mathcal{Y}_{N}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT:

Y𝑌\displaystyle Yitalic_Y =H2T(𝒴N),absentsubscriptH2Tsubscript𝒴𝑁\displaystyle=\mathcal{M}_{\text{H2T}}(\mathcal{Y}_{N}),= caligraphic_M start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT ( caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , (1)

Given the ground-truth transcription Y*superscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, we can finetune the LLM to learn H2TsubscriptH2T\mathcal{M}_{\text{H2T}}caligraphic_M start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT in an auto-regressive manner, where the cross-entropy loss H2TsubscriptH2T\mathcal{L}_{\text{H2T}}caligraphic_L start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT is formulated as:

H2TsubscriptH2T\displaystyle\mathcal{L}_{\text{H2T}}caligraphic_L start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT =t=1Tlog𝒫θ(yt*|yt1*,,y1*,𝒴N),absentsuperscriptsubscript𝑡1𝑇subscript𝒫𝜃conditionalsuperscriptsubscript𝑦𝑡superscriptsubscript𝑦𝑡1superscriptsubscript𝑦1subscript𝒴𝑁\displaystyle=\sum_{t=1}^{T}-\log\mathcal{P}_{\theta}(y_{t}^{*}|y_{t-1}^{*},% \cdots,y_{1}^{*},\mathcal{Y}_{N}),= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - roman_log caligraphic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) , (2)

where yt*superscriptsubscript𝑦𝑡y_{t}^{*}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the t𝑡titalic_t-th token of Y*superscript𝑌Y^{*}italic_Y start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, and θ𝜃\thetaitalic_θ denotes the learnable parameters in LLM (i.e., adapter).

3.2 Robust HyPoradise Dataset

Correspondingly, we develop a Robust HyPoradise dataset by collecting hypotheses-transcription (HT) pairs from common noisy ASR corpus, including CHiME-4 (Vincent et al., 2016), VoiceBank-DEMAND (Valentini-Botinhao et al., 2016), NOIZEUS (Hu & Loizou, 2006), LibriSpeech-FreeSound (Prasad et al., 2021) and RATS (Graff et al., 2014), with details provided in §A. We employ Whisper Large-V2 (Radford et al., 2023), the state-of-the-art ASR foundation model to transcribe the noisy speech into N-best hypotheses (N is set to 5). As a result, we collect 113K HT pairs in total from various noise domains, and the dataset statistics are presented in Table 6.

4 Method

Refer to caption
Figure 2: Left: The RobustGER framework that leverages efficient LLM finetuning to learn map** from ASR N-best hypotheses to ground-truth transcription, where we propose a language-space noise embedding with audio distillation to denoise GER process. Right: The extraction of language-space noise embedding from N-best hypotheses by measuring its diversity, where we calculate the utterance- and token-level embedding differences between each pair of hypotheses in the N-best list. The details of embedding extraction are illustrated in §4.2 and Eq. (4)-(6).

In this section, we present our noise-aware generative error correction (RobustGER) approach. We first describe the overall framework (§4.1), and then we introduce the extraction of language-space noise embedding from N-best hypotheses (§4.2), followed by audio noise distillation (§4.3) at last.

4.1 Overall Framework

The left part of Fig. 2 presents the overall framework of RobustGER. First, the noisy speech Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is sent into a pre-trained ASR model to generate N-best hypotheses 𝒴N={Y1,Y2,,YN}subscript𝒴𝑁subscript𝑌1subscript𝑌2subscript𝑌𝑁\mathcal{Y}_{N}=\{Y_{1},Y_{2},\cdots,Y_{N}\}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where N=5𝑁5N=5italic_N = 5. Following that, we propose to extract a language-space noise embedding ELNsubscript𝐸LNE_{\text{LN}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT from the N-best list 𝒴Nsubscript𝒴𝑁\mathcal{Y}_{N}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT to represent the noise conditions of source speech Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. As depicted in the right part of Fig. 2, such noise embedding measures the diversity of N-best hypotheses on both utterance and token levels, which perceives the noise information in input speech.

Furthermore, to enhance its noise representation ability, we design a KD approach to distill the real noise information in source speech Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to the extracted language-space noise embedding ELNsubscript𝐸LNE_{\text{LN}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT. Specifically, we employ the audio embedding ASR(Xn)subscriptASRsubscript𝑋𝑛\mathcal{E}_{\text{ASR}}(X_{n})caligraphic_E start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) from ASR encoder for distillation.

Finally, we add an instruction onto the N-best hypotheses and sent them into LLM to predict the true transcription (i.e., GER), with the language embedding incorporated for denoising. Specifically, we add a minus sign before the noise embedding ELNsubscript𝐸LNE_{\text{LN}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT to indicate ``denoise''. Such minus embedding is then sent to teach LLM to do language-space denoising. Therefore, Eq.(1) should be re-written as:

Y𝑌\displaystyle Yitalic_Y =H2T(𝒴N;ELN),absentsubscriptH2Tsubscript𝒴𝑁subscript𝐸LN\displaystyle=\mathcal{M}_{\text{H2T}}(\mathcal{Y}_{N};-E_{\text{LN}}),= caligraphic_M start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT ( caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ; - italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT ) , (3)

The H2TsubscriptH2T\mathcal{M}_{\text{H2T}}caligraphic_M start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT denotes H2T map** by efficient LLM finetuning, where we follow the adapter tuning from previous works (Zhang et al., 2023b; Yang et al., 2023b). We also borrow their idea of input-level prompting to incorporate our language noise embedding into LLM tuning, and the details are presented in §B.1. Similar to Eq.(2), we follow the original GER benchmark for optimization.

4.2 Language-space Noise Embedding

As directly incorporating audio-space noise embedding into LLM finetuning could harm its stability and performance (Zhang et al., 2023b; Gao et al., 2023), we propose an alternative to extract language-space noise embedding from N-best hypotheses to represent the noise conditions of source speech. The key idea is to perceive the audio noise from the diversity of N-best hypotheses, i.e., the worse noisy conditions (more challenging noise type or lower SNR), the higher uncertainty of ASR beam search decoding, and thus results in more diverse N-best hypotheses (see Table 15 and Fig 6).

As illustrated in the right part of Fig. 2, we extract the noise embedding on both utterance and token levels to capture rich diversity information: 1) Utterance-level: examine the diversity inside N-best list in terms of the entire utterance's semantic meaning, which indicates the affect of audio noise on the global semantics of hypotheses; 2) Token-level: examine the distribution of N-best hypothesis in terms of all the tokens inside, which is similar to edit distance and thus directly corresponds to the WER metric. These two embeddings are finally combined to form the resulted noise embedding, i.e., ELN=[ELNutt;ELNtok]subscript𝐸LNsuperscriptsubscript𝐸LN𝑢𝑡𝑡superscriptsubscript𝐸LN𝑡𝑜𝑘E_{\text{LN}}=[E_{\text{LN}}^{utt};E_{\text{LN}}^{tok}]italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT = [ italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT ; italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_k end_POSTSUPERSCRIPT ]. Specifically, we employ sentence-BERT (SBERT) (Reimers & Gurevych, 2019) to obtain the embeddings from raw text, which contains rich language-space semantic information.

4.2.1 Utterance-level Noise Embedding

Given N-best hypotheses 𝒴N={Y1,Y2,,YN}subscript𝒴𝑁subscript𝑌1subscript𝑌2subscript𝑌𝑁\mathcal{Y}_{N}=\{Y_{1},Y_{2},\cdots,Y_{N}\}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = { italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, we first obtain their sentence embeddings by SBERT encoder sbertsubscriptsbert\mathcal{E}_{\text{sbert}}caligraphic_E start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT and then calculate their diversity as:

ELNuttsuperscriptsubscript𝐸LN𝑢𝑡𝑡\displaystyle E_{\text{LN}}^{utt}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT =Concat{[sbert(Yi)sbert(Yj)]i,j=1,i>jN}N(N1)2×Dsbert,absentConcatsubscriptsuperscriptdelimited-[]subscriptsbertsubscript𝑌𝑖subscriptsbertsubscript𝑌𝑗𝑁formulae-sequence𝑖𝑗1𝑖𝑗superscript𝑁𝑁12subscript𝐷sbert\displaystyle=\text{Concat}\{[\mathcal{E}_{\text{sbert}}(Y_{i})-\mathcal{E}_{% \text{sbert}}(Y_{j})]^{N}_{i,j=1,i>j}\}\in\mathbb{R}^{\frac{N\cdot(N-1)}{2}% \times D_{\text{sbert}}},= Concat { [ caligraphic_E start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_E start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j = 1 , italic_i > italic_j end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_N ⋅ ( italic_N - 1 ) end_ARG start_ARG 2 end_ARG × italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (4)

where Dsbertsubscript𝐷sbertD_{\text{sbert}}italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT denotes the embedding size of SBERT extractor. In short, it concatenates all the sentence embedding differences sbert(Yi)sbert(Yj)subscriptsbertsubscript𝑌𝑖subscriptsbertsubscript𝑌𝑗\mathcal{E}_{\text{sbert}}(Y_{i})-\mathcal{E}_{\text{sbert}}(Y_{j})caligraphic_E start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_E start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) where i>j𝑖𝑗i>jitalic_i > italic_j, resulting in an utterance-level noise embedding ELNuttN(N1)/2×Dsbertsuperscriptsubscript𝐸LN𝑢𝑡𝑡superscript𝑁𝑁12subscript𝐷sbertE_{\text{LN}}^{utt}\in\mathbb{R}^{N\cdot(N-1)/2\times D_{\text{sbert}}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N ⋅ ( italic_N - 1 ) / 2 × italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The key idea is, Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ranks lower than Yjsubscript𝑌𝑗Y_{j}italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the N-best hypotheses list, which thus presents lower confidence and worse transcription quality, i.e., more language noise. Therefore, Eq.(4) serves as a measurement of the audio noise in language space. The worse noisy speech would lead to larger ASR decoding uncertainty and thus more diverse N-best hypotheses, so that Eq.(4) can capture larger diversity embedding.

4.2.2 Token-level Noise Embedding

Apart from utterance-level embedding, we also propose to extract token-level noise embedding that directly corresponds to the WER metric of ASR task. As shown in the bottom-right part of Fig. 2, similar to the calculation of edit distance, we first forced-align the N-best hypotheses to the same length with zero padding (i.e., ``Ø''). The aligned N-best hypotheses 𝒴Nali={Y1ali,Y2ali,,YNali}superscriptsubscript𝒴𝑁𝑎𝑙𝑖superscriptsubscript𝑌1𝑎𝑙𝑖superscriptsubscript𝑌2𝑎𝑙𝑖superscriptsubscript𝑌𝑁𝑎𝑙𝑖\mathcal{Y}_{N}^{ali}=\{Y_{1}^{ali},Y_{2}^{ali},\cdots,Y_{N}^{ali}\}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT = { italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT , ⋯ , italic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT } clearly illustrates the token difference between different candidatures, where each utterance contains T𝑇Titalic_T tokens that comes from ASR vocabulary 𝒱𝒱\mathcal{V}caligraphic_V plus zero padding Ø:

Yialisuperscriptsubscript𝑌𝑖𝑎𝑙𝑖\displaystyle Y_{i}^{ali}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT =[yi1ali,yi2ali,,yiTali],yitali𝒱Ø,formulae-sequenceabsentsuperscriptsubscript𝑦subscript𝑖1𝑎𝑙𝑖superscriptsubscript𝑦subscript𝑖2𝑎𝑙𝑖superscriptsubscript𝑦subscript𝑖𝑇𝑎𝑙𝑖superscriptsubscript𝑦subscript𝑖𝑡𝑎𝑙𝑖𝒱Ø\displaystyle=[y_{i_{1}}^{ali},y_{i_{2}}^{ali},\cdots,y_{i_{T}}^{ali}],\quad y% _{i_{t}}^{ali}\in\mathcal{V}\cup\text{\O{}},= [ italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT ] , italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT ∈ caligraphic_V ∪ Ø , (5)

Inspired by edit distance, we design an ``edit embedding'' to capture the token-level difference between two hypotheses, which directly corresponds to their gap in final WER performance. Then, similar to Eq.(4), we calculate the token-level noise embedding by summing up the edit embedding between different pairs of hypotheses in the N-best list:

ELNtoksuperscriptsubscript𝐸LN𝑡𝑜𝑘\displaystyle E_{\text{LN}}^{tok}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_k end_POSTSUPERSCRIPT =Concat{Eedit(Yiali,Yjali)i,j=1,i>jN}N(N1)2×Dsbert,absentConcatsubscript𝐸editsubscriptsuperscriptsuperscriptsubscript𝑌𝑖𝑎𝑙𝑖superscriptsubscript𝑌𝑗𝑎𝑙𝑖𝑁formulae-sequence𝑖𝑗1𝑖𝑗superscript𝑁𝑁12subscript𝐷sbert\displaystyle=\text{Concat}\{E_{\text{edit}}(Y_{i}^{ali},Y_{j}^{ali})^{N}_{i,j% =1,i>j}\}\in\mathbb{R}^{\frac{N(N-1)}{2}\times D_{\text{sbert}}},= Concat { italic_E start_POSTSUBSCRIPT edit end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j = 1 , italic_i > italic_j end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_N ( italic_N - 1 ) end_ARG start_ARG 2 end_ARG × italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (6)
Eedit(Yiali,Yjali)=t=1T[sbert(yitali)sbert(yjtali)],subscript𝐸editsuperscriptsubscript𝑌𝑖𝑎𝑙𝑖superscriptsubscript𝑌𝑗𝑎𝑙𝑖superscriptsubscript𝑡1𝑇delimited-[]subscriptsbertsuperscriptsubscript𝑦subscript𝑖𝑡𝑎𝑙𝑖subscriptsbertsuperscriptsubscript𝑦subscript𝑗𝑡𝑎𝑙𝑖\displaystyle E_{\text{edit}}(Y_{i}^{ali},Y_{j}^{ali})=\sum_{t=1}^{T}[\mathcal% {E}_{\text{sbert}}(y_{i_{t}}^{ali})-\mathcal{E}_{\text{sbert}}(y_{j_{t}}^{ali}% )],italic_E start_POSTSUBSCRIPT edit end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ caligraphic_E start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT ) - caligraphic_E start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_l italic_i end_POSTSUPERSCRIPT ) ] ,

Note that we employ SBERT again to extract the token embedding, as it can produce informative embeddings for both utterances and tokens (Reimers & Gurevych, 2019).

4.3 Audio Noise Distillation

Refer to caption
Figure 3: Audio noise distillation by mutual information neural estimation (MINE). The trainable tuner is designed to maximize the MI between our extracted noise embedding and the noisy speech.

After extracting the language-space noise embedding from N-best hypotheses, we further propose an audio noise distillation approach via mutual information estimation to enhance its noise representation ability. Mutual information (MI) is a measure of dependence between random variables based on the Shannon entropy, which is equivalent to the Kullback-Leibler (KL-) divergence between the joint distribution and the product of the marginal distribution of random variables. Given two random variables X𝑋Xitalic_X and Z𝑍Zitalic_Z, their MI can be calculated by:

I(X;Z)𝐼𝑋𝑍\displaystyle I(X;Z)italic_I ( italic_X ; italic_Z ) =DKL(XZXZ),absentsubscript𝐷𝐾𝐿conditionalsubscript𝑋𝑍subscript𝑋subscript𝑍\displaystyle=D_{KL}(\mathbb{P}_{XZ}\parallel\mathbb{P}_{X}\mathbb{P}_{Z}),= italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( blackboard_P start_POSTSUBSCRIPT italic_X italic_Z end_POSTSUBSCRIPT ∥ blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ) , (7)

where DKL()subscript𝐷𝐾𝐿conditionalD_{KL}(\mathbb{P}\parallel\mathbb{Q})italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( blackboard_P ∥ blackboard_Q ) denotes KL-divergence. However, it is intractable to directly calculate MI based on Eq.(7), so we leverage an estimation method called mutual information neural estimation (MINE) from previous work (Belghazi et al., 2018). MINE employs a statistics network ψ𝜽:𝒳×𝒵:subscript𝜓𝜽𝒳𝒵\psi_{\bm{\theta}}:\mathcal{X}\times\mathcal{Z}\rightarrow\mathbb{R}italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : caligraphic_X × caligraphic_Z → blackboard_R parameterized by θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ to estimate a neural information measure:

IΘ(X;Z)subscript𝐼Θ𝑋𝑍\displaystyle I_{\Theta}(X;Z)italic_I start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_X ; italic_Z ) =supθΘ𝔼XZ[ψ𝜽]log(𝔼XZ[eψ𝜽]),absentsubscriptsupremum𝜃Θsubscript𝔼subscript𝑋𝑍delimited-[]subscript𝜓𝜽subscript𝔼subscript𝑋subscript𝑍delimited-[]superscript𝑒subscript𝜓𝜽\displaystyle=\sup_{\theta\in\Theta}\mathbb{E}_{\mathbb{P}_{XZ}}[\psi_{\bm{% \theta}}]-\log(\mathbb{E}_{\mathbb{P}_{X}\mathbb{P}_{Z}}[e^{\psi_{\bm{\theta}}% }]),= roman_sup start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_X italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ] - roman_log ( blackboard_E start_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] ) , (8)

In practice, we employ the extracted language-space noise embedding ELNsubscript𝐸LNE_{\text{LN}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT and noisy audio embedding ASR(Xn)subscriptASRsubscript𝑋𝑛\mathcal{E}_{\text{ASR}}(X_{n})caligraphic_E start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) as the joint distribution, while using ELNsubscript𝐸LNE_{\text{LN}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT and clean audio embedding ASR(Xc)subscriptASRsubscript𝑋𝑐\mathcal{E}_{\text{ASR}}(X_{c})caligraphic_E start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) as the marginal distribution, as the noise information only exists in noisy speech.

Algorithm 1 describes how MINE is utilized for audio noise distillation, which includes two stages. First, the statistics network ψ𝜽subscript𝜓𝜽\psi_{\bm{\theta}}italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT is trained to learn accurate MI estimation using both the positive and negative sample pairs introduced above. Second, a learnable tuner 𝒯𝝎subscript𝒯𝝎\mathcal{T}_{\bm{\omega}}caligraphic_T start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT is introduced to modulate the language embedding ELNsubscript𝐸LNE_{\text{LN}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT to capture more real noise information, by maximizing the MI between it and the noisy audio embeddings. More details about the MINE-based audio noise distillation are in §B.2. In addition, the LLM adapter is also updated in second stage to learn H2T map** for GER.

Algorithm 1 Audio noise distillation via mutual information neural estimation (MINE).
1:LLM H2TsubscriptH2T\mathcal{M}_{\text{H2T}}caligraphic_M start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT with adapter 𝒢𝝊subscript𝒢𝝊\mathcal{G}_{\bm{\upsilon}}caligraphic_G start_POSTSUBSCRIPT bold_italic_υ end_POSTSUBSCRIPT, MINE statistics network ψ𝜓\psiitalic_ψ of parameters 𝜽𝜽\bm{\theta}bold_italic_θ, language embedding tuner 𝒯𝒯\mathcal{T}caligraphic_T of parameters 𝝎𝝎\bm{\omega}bold_italic_ω. N-best hypotheses 𝒴Nsubscript𝒴𝑁\mathcal{Y}_{N}caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. Parallel noisy speech 𝒳nsubscript𝒳𝑛\mathcal{X}_{n}caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and clean speech data 𝒳csubscript𝒳𝑐\mathcal{X}_{c}caligraphic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Batch size B𝐵Bitalic_B and the total number of iterations M𝑀Mitalic_M. Hyper-parameter weight λ𝜆\lambdaitalic_λ.
2:for m=1𝑚1m=1italic_m = 1 to M𝑀Mitalic_M do
3:     Draw B𝐵Bitalic_B N-best hypotheses samples from RobustHP dataset: {𝒴N(1),𝒴N(2),,𝒴N(B)}superscriptsubscript𝒴𝑁1superscriptsubscript𝒴𝑁2superscriptsubscript𝒴𝑁𝐵\{\mathcal{Y}_{N}^{(1)},\mathcal{Y}_{N}^{(2)},\cdots,\mathcal{Y}_{N}^{(B)}\}{ caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , caligraphic_Y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT };
4:     Draw corresponding noisy and clean speech samples: {(Xn(1),Xc(1)),(Xn(2),Xc(2)),,(Xn(B),Xc(B))}superscriptsubscript𝑋𝑛1superscriptsubscript𝑋𝑐1superscriptsubscript𝑋𝑛2superscriptsubscript𝑋𝑐2superscriptsubscript𝑋𝑛𝐵superscriptsubscript𝑋𝑐𝐵\{(X_{n}^{(1)},X_{c}^{(1)}),(X_{n}^{(2)},X_{c}^{(2)}),\cdots,(X_{n}^{(B)},X_{c% }^{(B)})\}{ ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ) , ⋯ , ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT ) };
5:     Extract language-space noise embedding from N-best list using Eq.(4-6): {ELN(1),ELN(2),,ELN(B)}superscriptsubscript𝐸LN1superscriptsubscript𝐸LN2superscriptsubscript𝐸LN𝐵\{E_{\text{LN}}^{(1)},E_{\text{LN}}^{(2)},\cdots,E_{\text{LN}}^{(B)}\}{ italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , ⋯ , italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_B ) end_POSTSUPERSCRIPT };
6:     Calculate Eq.(8): =1Bb=1Bψ𝜽(ELN(b),ASR(Xn(b)))log(1Bb=1Beψ𝜽(ELN(b),ASR(Xc(b))))1𝐵superscriptsubscript𝑏1𝐵subscript𝜓𝜽superscriptsubscript𝐸LN𝑏subscriptASRsuperscriptsubscript𝑋𝑛𝑏1𝐵superscriptsubscript𝑏1𝐵superscript𝑒subscript𝜓𝜽superscriptsubscript𝐸LN𝑏subscriptASRsuperscriptsubscript𝑋𝑐𝑏\mathcal{I}=\frac{1}{B}\sum_{b=1}^{B}\psi_{\bm{\theta}}(E_{\text{LN}}^{(b)},% \mathcal{E}_{\text{ASR}}(X_{n}^{(b)}))-\log(\frac{1}{B}\sum_{b=1}^{B}e^{\psi_{% \bm{\theta}}(E_{\text{LN}}^{(b)},\mathcal{E}_{\text{ASR}}(X_{c}^{(b)}))})caligraphic_I = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ) ) - roman_log ( divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ) ) end_POSTSUPERSCRIPT );
7:     Calculate 𝒈𝜽=𝜽()subscript𝒈𝜽subscript𝜽{\bm{g}}_{\bm{\theta}}=\nabla_{\bm{\theta}}(\mathcal{I})bold_italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_I ) and update 𝜽𝜽\bm{\theta}bold_italic_θ by gradient ascent: 𝜽𝜽+𝒈𝜽𝜽𝜽subscript𝒈𝜽\bm{\theta}\leftarrow\bm{\theta}+{\bm{g}}_{\bm{\theta}}bold_italic_θ ← bold_italic_θ + bold_italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT;
8:     Calculate GER cost function H2TsubscriptH2T\mathcal{L}_{\text{H2T}}caligraphic_L start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT using Eq.(2), with 𝒯𝝎(ELN(b))subscript𝒯𝝎superscriptsubscript𝐸LN𝑏\mathcal{T}_{\bm{\omega}}(E_{\text{LN}}^{(b)})caligraphic_T start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ) incorporated for denoising;
9:     Re-calculate the first term of Eq.(8): 1=1Bb=1Bψ𝜽(𝒯𝝎(ELN(b)),ASR(Xn(b)))subscript11𝐵superscriptsubscript𝑏1𝐵subscript𝜓𝜽subscript𝒯𝝎superscriptsubscript𝐸LN𝑏subscriptASRsuperscriptsubscript𝑋𝑛𝑏\mathcal{I}_{1}=\frac{1}{B}\sum_{b=1}^{B}\psi_{\bm{\theta}}(\mathcal{T}_{\bm{% \omega}}(E_{\text{LN}}^{(b)}),\mathcal{E}_{\text{ASR}}(X_{n}^{(b)}))caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( caligraphic_T start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ) , caligraphic_E start_POSTSUBSCRIPT ASR end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_b ) end_POSTSUPERSCRIPT ) );
10:     Calculate 𝒈𝝊,𝝎=𝝊,𝝎(H2Tλ1)subscript𝒈𝝊𝝎subscript𝝊𝝎subscriptH2T𝜆subscript1\bm{g_{\upsilon,\omega}}=\nabla_{\bm{\upsilon,\omega}}(\mathcal{L}_{\text{H2T}% }-\lambda\mathcal{I}_{1})bold_italic_g start_POSTSUBSCRIPT bold_italic_υ bold_, bold_italic_ω end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT bold_italic_υ bold_, bold_italic_ω end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT - italic_λ caligraphic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and update 𝝊,𝝎𝝊𝝎\bm{\upsilon,\omega}bold_italic_υ bold_, bold_italic_ω by gradient descent: 𝝊𝝊𝒈𝝊,𝝎𝝎𝒈𝝎formulae-sequence𝝊𝝊subscript𝒈𝝊𝝎𝝎subscript𝒈𝝎\bm{\upsilon}\leftarrow\bm{\upsilon}-\bm{g_{\upsilon}},\bm{\omega}\leftarrow% \bm{\omega}-\bm{g_{\omega}}bold_italic_υ ← bold_italic_υ - bold_italic_g start_POSTSUBSCRIPT bold_italic_υ end_POSTSUBSCRIPT , bold_italic_ω ← bold_italic_ω - bold_italic_g start_POSTSUBSCRIPT bold_italic_ω end_POSTSUBSCRIPT;
11:end for

5 Experiments

Table 1: WER (%) results of RobustGER with LLaMA-2-7b finetuning. ``LMranksubscriptLM𝑟𝑎𝑛𝑘\text{LM}_{rank}LM start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k end_POSTSUBSCRIPT'' denotes LM rescoring. ``+ Audio Denoising'' denotes introducing audio embedding to denoise GER. onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT and ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT respectively denote the N-best oracle and compositional oracle that are defined in §5.1. The subscript percentage denotes relative WER reduction over ASR baseline, i.e., GER improvement.
Test Set Baseline LMrank𝑟𝑎𝑛𝑘{}_{rank}start_FLOATSUBSCRIPT italic_r italic_a italic_n italic_k end_FLOATSUBSCRIPT GER + Audio Denoising RobustGER Oracle
(ours) onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT
CHiME-4 test-real 12.612.612.612.6 12.212.212.212.2 6.548.4%subscript6.5percent48.46.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}6.5 start_POSTSUBSCRIPT - 48.4 % end_POSTSUBSCRIPT 6.449.2%subscript6.4percent49.26.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-49.2\%}}6.4 start_POSTSUBSCRIPT - 49.2 % end_POSTSUBSCRIPT 5.655.6%subscript5.6percent55.6\bm{5.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-55.6\%}}}bold_5.6 start_POSTSUBSCRIPT bold_- bold_55.6 bold_% end_POSTSUBSCRIPT 10.510.510.510.5 3.03.03.03.0
test-simu 15.415.415.415.4 14.514.514.514.5 9.240.3%subscript9.2percent40.39.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-40.3\%}}9.2 start_POSTSUBSCRIPT - 40.3 % end_POSTSUBSCRIPT 9.041.6%subscript9.0percent41.69.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-41.6\%}}9.0 start_POSTSUBSCRIPT - 41.6 % end_POSTSUBSCRIPT 8.246.8%subscript8.2percent46.8\bm{8.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.8\%}}}bold_8.2 start_POSTSUBSCRIPT bold_- bold_46.8 bold_% end_POSTSUBSCRIPT 12.912.912.912.9 5.05.05.05.0
dev-real 10.610.610.610.6 10.310.310.310.3 5.052.8%subscript5.0percent52.85.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.8\%}}5.0 start_POSTSUBSCRIPT - 52.8 % end_POSTSUBSCRIPT 4.953.8%subscript4.9percent53.84.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.8\%}}4.9 start_POSTSUBSCRIPT - 53.8 % end_POSTSUBSCRIPT 4.161.3%subscript4.1percent61.3\bm{4.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-61.3\%}}}bold_4.1 start_POSTSUBSCRIPT bold_- bold_61.3 bold_% end_POSTSUBSCRIPT 9.19.19.19.1 2.12.12.12.1
dev-simu 12.412.412.412.4 11.911.911.911.9 6.845.2%subscript6.8percent45.26.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.2\%}}6.8 start_POSTSUBSCRIPT - 45.2 % end_POSTSUBSCRIPT 6.646.8%subscript6.6percent46.86.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.8\%}}6.6 start_POSTSUBSCRIPT - 46.8 % end_POSTSUBSCRIPT 5.853.2%subscript5.8percent53.2\bm{5.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}}bold_5.8 start_POSTSUBSCRIPT bold_- bold_53.2 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 3.33.33.33.3
avg. 12.812.812.812.8 12.212.212.212.2 6.946.1%subscript6.9percent46.16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.1\%}}6.9 start_POSTSUBSCRIPT - 46.1 % end_POSTSUBSCRIPT 6.747.7%subscript6.7percent47.76.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.7\%}}6.7 start_POSTSUBSCRIPT - 47.7 % end_POSTSUBSCRIPT 5.953.9%subscript5.9percent53.9\bm{5.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.9\%}}}bold_5.9 start_POSTSUBSCRIPT bold_- bold_53.9 bold_% end_POSTSUBSCRIPT 10.810.810.810.8 3.43.43.43.4
VB-DEMAND baby-cry 8.08.08.08.0 7.87.87.87.8 7.012.5%subscript7.0percent12.57.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}7.0 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 6.913.8%subscript6.9percent13.86.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.8\%}}6.9 start_POSTSUBSCRIPT - 13.8 % end_POSTSUBSCRIPT 6.025.0%subscript6.0percent25.0\bm{6.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-25.0\%}}}bold_6.0 start_POSTSUBSCRIPT bold_- bold_25.0 bold_% end_POSTSUBSCRIPT 4.54.54.54.5 3.03.03.03.0
helicopter 8.48.48.48.4 8.18.18.18.1 7.411.9%subscript7.4percent11.97.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.9\%}}7.4 start_POSTSUBSCRIPT - 11.9 % end_POSTSUBSCRIPT 7.313.1%subscript7.3percent13.17.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.1\%}}7.3 start_POSTSUBSCRIPT - 13.1 % end_POSTSUBSCRIPT 6.917.9%subscript6.9percent17.9\bm{6.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}}bold_6.9 start_POSTSUBSCRIPT bold_- bold_17.9 bold_% end_POSTSUBSCRIPT 4.84.84.84.8 3.23.23.23.2
crowd-party 22.622.622.622.6 22.322.322.322.3 21.45.3%subscript21.4percent5.321.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}21.4 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 21.07.1%subscript21.0percent7.121.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.1\%}}21.0 start_POSTSUBSCRIPT - 7.1 % end_POSTSUBSCRIPT 19.215.0%subscript19.2percent15.0\bm{19.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 5.0\%}}}bold_19.2 start_POSTSUBSCRIPT bold_- bold_15.0 bold_% end_POSTSUBSCRIPT 16.516.516.516.5 11.511.511.511.5
avg. 13.013.013.013.0 12.712.712.712.7 11.98.5%subscript11.9percent8.511.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.5\%}}11.9 start_POSTSUBSCRIPT - 8.5 % end_POSTSUBSCRIPT 11.710.0%subscript11.7percent10.011.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}11.7 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 10.717.7%subscript10.7percent17.7\bm{10.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 7.7\%}}}bold_10.7 start_POSTSUBSCRIPT bold_- bold_17.7 bold_% end_POSTSUBSCRIPT 8.68.68.68.6 5.95.95.95.9
NOIZEUS babble 16.516.516.516.5 16.716.716.716.7 16.50.0%subscript16.5percent0.016.5_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}16.5 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 16.12.4%subscript16.1percent2.416.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.4\%}}16.1 start_POSTSUBSCRIPT - 2.4 % end_POSTSUBSCRIPT 14.512.1%subscript14.5percent12.1\bm{14.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 2.1\%}}}bold_14.5 start_POSTSUBSCRIPT bold_- bold_12.1 bold_% end_POSTSUBSCRIPT 9.59.59.59.5 5.85.85.85.8
car 17.417.417.417.4 16.816.816.816.8 15.312.1%subscript15.3percent12.115.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.1\%}}15.3 start_POSTSUBSCRIPT - 12.1 % end_POSTSUBSCRIPT 15.212.6%subscript15.2percent12.615.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.6\%}}15.2 start_POSTSUBSCRIPT - 12.6 % end_POSTSUBSCRIPT 14.914.4%subscript14.9percent14.4\bm{14.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 4.4\%}}}bold_14.9 start_POSTSUBSCRIPT bold_- bold_14.4 bold_% end_POSTSUBSCRIPT 9.99.99.99.9 7.97.97.97.9
station 12.012.012.012.0 11.611.611.611.6 10.314.2%subscript10.3percent14.210.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.2\%}}10.3 start_POSTSUBSCRIPT - 14.2 % end_POSTSUBSCRIPT 10.314.2%subscript10.3percent14.210.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.2\%}}10.3 start_POSTSUBSCRIPT - 14.2 % end_POSTSUBSCRIPT 9.520.8%subscript9.5percent20.8\bm{9.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.8\%}}}bold_9.5 start_POSTSUBSCRIPT bold_- bold_20.8 bold_% end_POSTSUBSCRIPT 6.66.66.66.6 5.05.05.05.0
train 15.315.315.315.3 15.215.215.215.2 14.92.6%subscript14.9percent2.614.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.6\%}}14.9 start_POSTSUBSCRIPT - 2.6 % end_POSTSUBSCRIPT 15.02.0%subscript15.0percent2.015.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.0\%}}15.0 start_POSTSUBSCRIPT - 2.0 % end_POSTSUBSCRIPT 14.92.6%subscript14.9percent2.6\bm{14.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2% .6\%}}}bold_14.9 start_POSTSUBSCRIPT bold_- bold_2.6 bold_% end_POSTSUBSCRIPT 10.310.310.310.3 7.97.97.97.9
street 17.417.417.417.4 17.217.217.217.2 17.40.0%subscript17.4percent0.017.4_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}17.4 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 17.11.7%subscript17.1percent1.717.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.7\%}}17.1 start_POSTSUBSCRIPT - 1.7 % end_POSTSUBSCRIPT 16.17.5%subscript16.1percent7.5\bm{16.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7% .5\%}}}bold_16.1 start_POSTSUBSCRIPT bold_- bold_7.5 bold_% end_POSTSUBSCRIPT 12.412.412.412.4 9.99.99.99.9
airport 11.211.211.211.2 11.011.011.011.0 10.74.5%subscript10.7percent4.510.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.5\%}}10.7 start_POSTSUBSCRIPT - 4.5 % end_POSTSUBSCRIPT 10.56.3%subscript10.5percent6.310.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.3\%}}10.5 start_POSTSUBSCRIPT - 6.3 % end_POSTSUBSCRIPT 9.515.2%subscript9.5percent15.2\bm{9.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.2\%}}}bold_9.5 start_POSTSUBSCRIPT bold_- bold_15.2 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.54.54.54.5
exhibition 13.213.213.213.2 13.213.213.213.2 12.83.0%subscript12.8percent3.012.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.0\%}}12.8 start_POSTSUBSCRIPT - 3.0 % end_POSTSUBSCRIPT 12.46.1%subscript12.4percent6.112.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.1\%}}12.4 start_POSTSUBSCRIPT - 6.1 % end_POSTSUBSCRIPT 9.528.0%subscript9.5percent28.0\bm{9.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-28.0\%}}}bold_9.5 start_POSTSUBSCRIPT bold_- bold_28.0 bold_% end_POSTSUBSCRIPT 8.38.38.38.3 5.85.85.85.8
restaurant 13.213.213.213.2 13.013.013.013.0 12.46.1%subscript12.4percent6.112.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.1\%}}12.4 start_POSTSUBSCRIPT - 6.1 % end_POSTSUBSCRIPT 12.55.3%subscript12.5percent5.312.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}12.5 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 12.09.1%subscript12.0percent9.1\bm{12.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9% .1\%}}}bold_12.0 start_POSTSUBSCRIPT bold_- bold_9.1 bold_% end_POSTSUBSCRIPT 8.78.78.78.7 6.26.26.26.2
avg. 14.514.514.514.5 14.314.314.314.3 13.84.8%subscript13.8percent4.813.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.8\%}}13.8 start_POSTSUBSCRIPT - 4.8 % end_POSTSUBSCRIPT 13.66.2%subscript13.6percent6.213.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.2\%}}13.6 start_POSTSUBSCRIPT - 6.2 % end_POSTSUBSCRIPT 12.613.1%subscript12.6percent13.1\bm{12.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 3.1\%}}}bold_12.6 start_POSTSUBSCRIPT bold_- bold_13.1 bold_% end_POSTSUBSCRIPT 9.29.29.29.2 6.66.66.66.6
LS-FreeSound metro 9.99.99.99.9 9.89.89.89.8 9.54.0%subscript9.5percent4.09.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.0\%}}9.5 start_POSTSUBSCRIPT - 4.0 % end_POSTSUBSCRIPT 9.45.1%subscript9.4percent5.19.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.1\%}}9.4 start_POSTSUBSCRIPT - 5.1 % end_POSTSUBSCRIPT 8.910.1%subscript8.9percent10.1\bm{8.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.1\%}}}bold_8.9 start_POSTSUBSCRIPT bold_- bold_10.1 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.94.94.94.9
car 4.04.04.04.0 4.04.04.04.0 3.77.5%subscript3.7percent7.53.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}3.7 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 3.512.5%subscript3.5percent12.53.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}3.5 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 3.122.5%subscript3.1percent22.5\bm{3.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.5\%}}}bold_3.1 start_POSTSUBSCRIPT bold_- bold_22.5 bold_% end_POSTSUBSCRIPT 3.03.03.03.0 1.81.81.81.8
traffic 8.38.38.38.3 8.28.28.28.2 8.03.6%subscript8.0percent3.68.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.6\%}}8.0 start_POSTSUBSCRIPT - 3.6 % end_POSTSUBSCRIPT 7.86.0%subscript7.8percent6.07.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.0\%}}7.8 start_POSTSUBSCRIPT - 6.0 % end_POSTSUBSCRIPT 7.59.6%subscript7.5percent9.6\bm{7.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.6\%}}}bold_7.5 start_POSTSUBSCRIPT bold_- bold_9.6 bold_% end_POSTSUBSCRIPT 6.86.86.86.8 4.54.54.54.5
cafe 9.89.89.89.8 9.59.59.59.5 8.117.3%subscript8.1percent17.38.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.3\%}}8.1 start_POSTSUBSCRIPT - 17.3 % end_POSTSUBSCRIPT 8.117.3%subscript8.1percent17.38.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.3\%}}8.1 start_POSTSUBSCRIPT - 17.3 % end_POSTSUBSCRIPT 7.523.5%subscript7.5percent23.5\bm{7.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.5\%}}}bold_7.5 start_POSTSUBSCRIPT bold_- bold_23.5 bold_% end_POSTSUBSCRIPT 7.17.17.17.1 4.64.64.64.6
babble 32.032.032.032.0 31.831.831.831.8 31.32.2%subscript31.3percent2.231.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.2\%}}31.3 start_POSTSUBSCRIPT - 2.2 % end_POSTSUBSCRIPT 31.61.3%subscript31.6percent1.331.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.3\%}}31.6 start_POSTSUBSCRIPT - 1.3 % end_POSTSUBSCRIPT 31.12.8%subscript31.1percent2.8\bm{31.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2% .8\%}}}bold_31.1 start_POSTSUBSCRIPT bold_- bold_2.8 bold_% end_POSTSUBSCRIPT 28.728.728.728.7 19.319.319.319.3
ac/vacuum 12.412.412.412.4 12.512.512.512.5 12.30.8%subscript12.3percent0.812.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-0.8\%}}12.3 start_POSTSUBSCRIPT - 0.8 % end_POSTSUBSCRIPT 12.12.4%subscript12.1percent2.412.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.4\%}}12.1 start_POSTSUBSCRIPT - 2.4 % end_POSTSUBSCRIPT 11.48.1%subscript11.4percent8.1\bm{11.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8% .1\%}}}bold_11.4 start_POSTSUBSCRIPT bold_- bold_8.1 bold_% end_POSTSUBSCRIPT 10.210.210.210.2 6.26.26.26.2
avg. 12.712.712.712.7 12.612.612.612.6 12.23.9%subscript12.2percent3.912.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.9\%}}12.2 start_POSTSUBSCRIPT - 3.9 % end_POSTSUBSCRIPT 12.14.7%subscript12.1percent4.712.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.7\%}}12.1 start_POSTSUBSCRIPT - 4.7 % end_POSTSUBSCRIPT 11.68.7%subscript11.6percent8.7\bm{11.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8% .7\%}}}bold_11.6 start_POSTSUBSCRIPT bold_- bold_8.7 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 6.96.96.96.9
RATS test 45.745.745.745.7 45.645.645.645.6 45.21.1%subscript45.2percent1.145.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.1\%}}45.2 start_POSTSUBSCRIPT - 1.1 % end_POSTSUBSCRIPT 44.82.0%subscript44.8percent2.044.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.0\%}}44.8 start_POSTSUBSCRIPT - 2.0 % end_POSTSUBSCRIPT 43.25.5%subscript43.2percent5.5\bm{43.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5% .5\%}}}bold_43.2 start_POSTSUBSCRIPT bold_- bold_5.5 bold_% end_POSTSUBSCRIPT 38.838.838.838.8 23.623.623.623.6
Table 2: WER (%) results of RobustGER on different SNR-level testing conditions. The test sets are from LS-FreeSound dataset, with five SNR levels on two noise types. More results are in Table 11.
Noise Type SNR (dB) Baseline LMrank𝑟𝑎𝑛𝑘{}_{rank}start_FLOATSUBSCRIPT italic_r italic_a italic_n italic_k end_FLOATSUBSCRIPT GER + Audio Denoising RobustGER Oracle
(ours) onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT
Metro 0 9.99.99.99.9 9.89.89.89.8 9.54.0%subscript9.5percent4.09.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.0\%}}9.5 start_POSTSUBSCRIPT - 4.0 % end_POSTSUBSCRIPT 9.45.1%subscript9.4percent5.19.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.1\%}}9.4 start_POSTSUBSCRIPT - 5.1 % end_POSTSUBSCRIPT 8.910.1%subscript8.9percent10.1\bm{8.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.1\%}}}bold_8.9 start_POSTSUBSCRIPT bold_- bold_10.1 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.94.94.94.9
5 7.27.27.27.2 7.07.07.07.0 6.76.9%subscript6.7percent6.96.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.9\%}}6.7 start_POSTSUBSCRIPT - 6.9 % end_POSTSUBSCRIPT 6.411.1%subscript6.4percent11.16.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.1\%}}6.4 start_POSTSUBSCRIPT - 11.1 % end_POSTSUBSCRIPT 5.523.6%subscript5.5percent23.6\bm{5.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.6\%}}}bold_5.5 start_POSTSUBSCRIPT bold_- bold_23.6 bold_% end_POSTSUBSCRIPT 5.55.55.55.5 3.23.23.23.2
10 4.84.84.84.8 4.64.64.64.6 4.212.5%subscript4.2percent12.54.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}4.2 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 4.310.4%subscript4.3percent10.44.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.4\%}}4.3 start_POSTSUBSCRIPT - 10.4 % end_POSTSUBSCRIPT 4.016.7%subscript4.0percent16.7\bm{4.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}}bold_4.0 start_POSTSUBSCRIPT bold_- bold_16.7 bold_% end_POSTSUBSCRIPT 3.93.93.93.9 2.32.32.32.3
15 3.93.93.93.9 3.53.53.53.5 3.217.9%subscript3.2percent17.93.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}3.2 start_POSTSUBSCRIPT - 17.9 % end_POSTSUBSCRIPT 3.217.9%subscript3.2percent17.93.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}3.2 start_POSTSUBSCRIPT - 17.9 % end_POSTSUBSCRIPT 3.023.1%subscript3.0percent23.1\bm{3.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.1\%}}}bold_3.0 start_POSTSUBSCRIPT bold_- bold_23.1 bold_% end_POSTSUBSCRIPT 3.13.13.13.1 1.71.71.71.7
20 3.33.33.33.3 3.13.13.13.1 2.718.2%subscript2.7percent18.22.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.2\%}}2.7 start_POSTSUBSCRIPT - 18.2 % end_POSTSUBSCRIPT 2.621.2%subscript2.6percent21.22.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-21.2\%}}2.6 start_POSTSUBSCRIPT - 21.2 % end_POSTSUBSCRIPT 2.330.3%subscript2.3percent30.3\bm{2.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-30.3\%}}}bold_2.3 start_POSTSUBSCRIPT bold_- bold_30.3 bold_% end_POSTSUBSCRIPT 2.62.62.62.6 1.31.31.31.3
avg. 5.85.85.85.8 5.65.65.65.6 5.38.6%subscript5.3percent8.65.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.6\%}}5.3 start_POSTSUBSCRIPT - 8.6 % end_POSTSUBSCRIPT 5.210.3%subscript5.2percent10.35.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.3\%}}5.2 start_POSTSUBSCRIPT - 10.3 % end_POSTSUBSCRIPT 4.719.0%subscript4.7percent19.0\bm{4.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-19.0\%}}}bold_4.7 start_POSTSUBSCRIPT bold_- bold_19.0 bold_% end_POSTSUBSCRIPT 4.64.64.64.6 2.72.72.72.7
AC/Vacuum 0 12.412.412.412.4 12.512.512.512.5 12.30.8%subscript12.3percent0.812.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-0.8\%}}12.3 start_POSTSUBSCRIPT - 0.8 % end_POSTSUBSCRIPT 12.12.4%subscript12.1percent2.412.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.4\%}}12.1 start_POSTSUBSCRIPT - 2.4 % end_POSTSUBSCRIPT 11.48.1%subscript11.4percent8.1\bm{11.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8% .1\%}}}bold_11.4 start_POSTSUBSCRIPT bold_- bold_8.1 bold_% end_POSTSUBSCRIPT 10.210.210.210.2 6.26.26.26.2
5 7.47.47.47.4 7.07.07.07.0 6.512.2%subscript6.5percent12.26.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.2\%}}6.5 start_POSTSUBSCRIPT - 12.2 % end_POSTSUBSCRIPT 6.314.9%subscript6.3percent14.96.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.9\%}}6.3 start_POSTSUBSCRIPT - 14.9 % end_POSTSUBSCRIPT 5.821.6%subscript5.8percent21.6\bm{5.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-21.6\%}}}bold_5.8 start_POSTSUBSCRIPT bold_- bold_21.6 bold_% end_POSTSUBSCRIPT 5.55.55.55.5 3.13.13.13.1
10 6.66.66.66.6 6.26.26.26.2 5.516.7%subscript5.5percent16.75.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}5.5 start_POSTSUBSCRIPT - 16.7 % end_POSTSUBSCRIPT 5.615.2%subscript5.6percent15.25.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.2\%}}5.6 start_POSTSUBSCRIPT - 15.2 % end_POSTSUBSCRIPT 5.516.7%subscript5.5percent16.7\bm{5.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}}bold_5.5 start_POSTSUBSCRIPT bold_- bold_16.7 bold_% end_POSTSUBSCRIPT 4.54.54.54.5 2.62.62.62.6
15 4.44.44.44.4 4.24.24.24.2 3.715.9%subscript3.7percent15.93.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.9\%}}3.7 start_POSTSUBSCRIPT - 15.9 % end_POSTSUBSCRIPT 3.715.9%subscript3.7percent15.93.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.9\%}}3.7 start_POSTSUBSCRIPT - 15.9 % end_POSTSUBSCRIPT 3.618.2%subscript3.6percent18.2\bm{3.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.2\%}}}bold_3.6 start_POSTSUBSCRIPT bold_- bold_18.2 bold_% end_POSTSUBSCRIPT 3.33.33.33.3 1.81.81.81.8
20 3.83.83.83.8 3.73.73.73.7 3.313.2%subscript3.3percent13.23.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.2\%}}3.3 start_POSTSUBSCRIPT - 13.2 % end_POSTSUBSCRIPT 3.215.8%subscript3.2percent15.83.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.8\%}}3.2 start_POSTSUBSCRIPT - 15.8 % end_POSTSUBSCRIPT 2.923.7%subscript2.9percent23.7\bm{2.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.7\%}}}bold_2.9 start_POSTSUBSCRIPT bold_- bold_23.7 bold_% end_POSTSUBSCRIPT 2.82.82.82.8 1.41.41.41.4
avg. 6.96.96.96.9 6.76.76.76.7 6.38.7%subscript6.3percent8.76.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.7\%}}6.3 start_POSTSUBSCRIPT - 8.7 % end_POSTSUBSCRIPT 6.210.1%subscript6.2percent10.16.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.1\%}}6.2 start_POSTSUBSCRIPT - 10.1 % end_POSTSUBSCRIPT 5.815.9%subscript5.8percent15.9\bm{5.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.9\%}}}bold_5.8 start_POSTSUBSCRIPT bold_- bold_15.9 bold_% end_POSTSUBSCRIPT 5.35.35.35.3 3.03.03.03.0
Clean \infty 3.03.03.03.0 2.82.82.82.8 2.516.7%subscript2.5percent16.72.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}2.5 start_POSTSUBSCRIPT - 16.7 % end_POSTSUBSCRIPT 2.420.0%subscript2.4percent20.02.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}2.4 start_POSTSUBSCRIPT - 20.0 % end_POSTSUBSCRIPT 2.130.0%subscript2.1percent30.0\bm{2.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-30.0\%}}}bold_2.1 start_POSTSUBSCRIPT bold_- bold_30.0 bold_% end_POSTSUBSCRIPT 2.52.52.52.5 1.41.41.41.4
Table 3: Ablation study of the language-space noise embedding in terms of utterance and token levels. More studies are presented in Table 13 and Table 14.
Test Set Baseline GER + Audio Denoising + Language Denoising
Utt.-level Tok.-level Both
CHiME-4 test-real 12.612.612.612.6 6.548.4%subscript6.5percent48.46.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}6.5 start_POSTSUBSCRIPT - 48.4 % end_POSTSUBSCRIPT 6.449.2%subscript6.4percent49.26.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-49.2\%}}6.4 start_POSTSUBSCRIPT - 49.2 % end_POSTSUBSCRIPT 6.449.2%subscript6.4percent49.26.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-49.2\%}}6.4 start_POSTSUBSCRIPT - 49.2 % end_POSTSUBSCRIPT 6.151.6%subscript6.1percent51.66.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-51.6\%}}6.1 start_POSTSUBSCRIPT - 51.6 % end_POSTSUBSCRIPT 5.953.2%subscript5.9percent53.2\bm{5.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}}bold_5.9 start_POSTSUBSCRIPT bold_- bold_53.2 bold_% end_POSTSUBSCRIPT
test-simu 15.415.415.415.4 9.240.3%subscript9.2percent40.39.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-40.3\%}}9.2 start_POSTSUBSCRIPT - 40.3 % end_POSTSUBSCRIPT 9.041.6%subscript9.0percent41.69.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-41.6\%}}9.0 start_POSTSUBSCRIPT - 41.6 % end_POSTSUBSCRIPT 9.140.9%subscript9.1percent40.99.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-40.9\%}}9.1 start_POSTSUBSCRIPT - 40.9 % end_POSTSUBSCRIPT 8.942.2%subscript8.9percent42.28.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-42.2\%}}8.9 start_POSTSUBSCRIPT - 42.2 % end_POSTSUBSCRIPT 8.644.2%subscript8.6percent44.2\bm{8.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-44.2\%}}}bold_8.6 start_POSTSUBSCRIPT bold_- bold_44.2 bold_% end_POSTSUBSCRIPT
dev-real 10.610.610.610.6 5.052.8%subscript5.0percent52.85.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.8\%}}5.0 start_POSTSUBSCRIPT - 52.8 % end_POSTSUBSCRIPT 4.953.8%subscript4.9percent53.84.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.8\%}}4.9 start_POSTSUBSCRIPT - 53.8 % end_POSTSUBSCRIPT 4.755.7%subscript4.7percent55.74.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-55.7\%}}4.7 start_POSTSUBSCRIPT - 55.7 % end_POSTSUBSCRIPT 4.458.5%subscript4.4percent58.54.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-58.5\%}}4.4 start_POSTSUBSCRIPT - 58.5 % end_POSTSUBSCRIPT 4.458.5%subscript4.4percent58.5\bm{4.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-58.5\%}}}bold_4.4 start_POSTSUBSCRIPT bold_- bold_58.5 bold_% end_POSTSUBSCRIPT
dev-simu 12.412.412.412.4 6.845.2%subscript6.8percent45.26.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.2\%}}6.8 start_POSTSUBSCRIPT - 45.2 % end_POSTSUBSCRIPT 6.646.8%subscript6.6percent46.86.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.8\%}}6.6 start_POSTSUBSCRIPT - 46.8 % end_POSTSUBSCRIPT 6.448.4%subscript6.4percent48.46.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}6.4 start_POSTSUBSCRIPT - 48.4 % end_POSTSUBSCRIPT 6.349.2%subscript6.3percent49.26.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-49.2\%}}6.3 start_POSTSUBSCRIPT - 49.2 % end_POSTSUBSCRIPT 6.150.8%subscript6.1percent50.8\bm{6.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}}bold_6.1 start_POSTSUBSCRIPT bold_- bold_50.8 bold_% end_POSTSUBSCRIPT
avg. 12.812.812.812.8 6.946.1%subscript6.9percent46.16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.1\%}}6.9 start_POSTSUBSCRIPT - 46.1 % end_POSTSUBSCRIPT 6.747.7%subscript6.7percent47.76.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.7\%}}6.7 start_POSTSUBSCRIPT - 47.7 % end_POSTSUBSCRIPT 6.747.7%subscript6.7percent47.76.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.7\%}}6.7 start_POSTSUBSCRIPT - 47.7 % end_POSTSUBSCRIPT 6.450.0%subscript6.4percent50.06.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.0\%}}6.4 start_POSTSUBSCRIPT - 50.0 % end_POSTSUBSCRIPT 6.350.8%subscript6.3percent50.8\bm{6.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}}bold_6.3 start_POSTSUBSCRIPT bold_- bold_50.8 bold_% end_POSTSUBSCRIPT
VB-DEMAND baby-cry 8.08.08.08.0 7.012.5%subscript7.0percent12.57.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}7.0 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 6.913.8%subscript6.9percent13.86.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.8\%}}6.9 start_POSTSUBSCRIPT - 13.8 % end_POSTSUBSCRIPT 6.716.3%subscript6.7percent16.36.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.3\%}}6.7 start_POSTSUBSCRIPT - 16.3 % end_POSTSUBSCRIPT 6.617.5%subscript6.6percent17.56.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.5\%}}6.6 start_POSTSUBSCRIPT - 17.5 % end_POSTSUBSCRIPT 6.420.0%subscript6.4percent20.0\bm{6.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}}bold_6.4 start_POSTSUBSCRIPT bold_- bold_20.0 bold_% end_POSTSUBSCRIPT
helicopter 8.48.48.48.4 7.411.9%subscript7.4percent11.97.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.9\%}}7.4 start_POSTSUBSCRIPT - 11.9 % end_POSTSUBSCRIPT 7.313.1%subscript7.3percent13.17.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.1\%}}7.3 start_POSTSUBSCRIPT - 13.1 % end_POSTSUBSCRIPT 7.313.1%subscript7.3percent13.17.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.1\%}}7.3 start_POSTSUBSCRIPT - 13.1 % end_POSTSUBSCRIPT 7.115.5%subscript7.1percent15.57.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.5\%}}7.1 start_POSTSUBSCRIPT - 15.5 % end_POSTSUBSCRIPT 7.115.5%subscript7.1percent15.5\bm{7.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.5\%}}}bold_7.1 start_POSTSUBSCRIPT bold_- bold_15.5 bold_% end_POSTSUBSCRIPT
crowd-party 22.622.622.622.6 21.45.3%subscript21.4percent5.321.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}21.4 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 21.07.1%subscript21.0percent7.121.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.1\%}}21.0 start_POSTSUBSCRIPT - 7.1 % end_POSTSUBSCRIPT 20.88.0%subscript20.8percent8.020.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.0\%}}20.8 start_POSTSUBSCRIPT - 8.0 % end_POSTSUBSCRIPT 20.310.2%subscript20.3percent10.220.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.2\%}}20.3 start_POSTSUBSCRIPT - 10.2 % end_POSTSUBSCRIPT 19.911.9%subscript19.9percent11.9\bm{19.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 1.9\%}}}bold_19.9 start_POSTSUBSCRIPT bold_- bold_11.9 bold_% end_POSTSUBSCRIPT
avg. 13.013.013.013.0 11.98.5%subscript11.9percent8.511.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.5\%}}11.9 start_POSTSUBSCRIPT - 8.5 % end_POSTSUBSCRIPT 11.710.0%subscript11.7percent10.011.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}11.7 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 11.610.8%subscript11.6percent10.811.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.8\%}}11.6 start_POSTSUBSCRIPT - 10.8 % end_POSTSUBSCRIPT 11.313.1%subscript11.3percent13.111.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.1\%}}11.3 start_POSTSUBSCRIPT - 13.1 % end_POSTSUBSCRIPT 11.114.6%subscript11.1percent14.6\bm{11.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 4.6\%}}}bold_11.1 start_POSTSUBSCRIPT bold_- bold_14.6 bold_% end_POSTSUBSCRIPT
Refer to caption
Figure 4: t-SNE visualizations of (a) language-space noise embedding, (b) language embedding with audio distillation, (c) audio noise embeddings. Cluster distances are in Table 17. Details are in §8.

5.1 Setup

We conduct experiments on the proposed RobustHP dataset, which is detailed in §A. To verify the general effectiveness of our approach, we utilize various latest LLMs for evaluation, including LLaMA-2-7b/13b (Touvron et al., 2023b), LLaMA-7b (Touvron et al., 2023a) and Falcon-7b (Penedo et al., 2023). We follow the LLM-Adapter in previous work (Zhang et al., 2023b) for both LLM finetuning and noise embedding incorporation. Details of model and experiment setups are in §C.

We report experimental results in terms of word error rate (WER) and relative GER improvement. We also report two oracle WERs for reference: 1) N-best oracle onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT: WER of the ``best candidate'' in N-best list, and 2) compositional oracle ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT: best achievable WER using all the tokens in N-best hypotheses. They indicate the upper-bounds of rerank and GER (using occurred tokens), respectively.

5.2 Performance of RobustGER

Table 1 presents the experiment results on LLaMA-2-7b, and more LLMs are evaluated in §D.1. First, we can observe minor gains of performance brought by typical LM rescoring over the Whisper ASR baseline. Compared to LM rescoring, GER achieves promising progress by leveraging LLMs to generate transcription, while its performance gains in most noisy conditions except CHiME-4 are still limited. Introducing audio denoising further improves the result but suffers from the cross-modality gap. In comparison, with the proposed language-space denoising approach, our RobustGER achieves significant gains of performance in various noise conditions, with up to 53.9% GER improvement in terms of WER metric, where some results even surpass the reranking upper-bound.

Table 2 reports the performance of RobustGER under different SNRs, where we can observe consistent WER improvements on various noise levels. In addition, RobustGER also shows great effectiveness on clean test data with 30.0% relative WER reduction, which verifies its excellent generality.

5.3 Ablation Study

Table 3 illustrates the ablation study on the extraction of language-space noise embedding, which includes both utterance- and token-level information as introduced in §4.2. We can observe that utterance-level embedding only yields minor improvements over vanilla GER, indicating that the global semantics diversity of N-best hypotheses is not fine-grained enough for error correction. On the other hand, token-level information plays a significant role in language-space denoising for GER, as it directly corresponds to the word error rate metric. Combining both performs the best by leveraging richer information to measure N-best list diversity.

In addition, we also conduct ablation studies on the language embedding extractor (i.e., SBERT vs. FastText (Grave et al., 2018), LLaMA embedding.) in §D.3, as well as the audio noise distillation techniques (i.e., MINE vs. contrastive learning, teacher-student learning) in §D.4. All of them verify the effectiveness of our specific designs in RobustGER system.

5.4 Analysis

Visualizations of Noise Embeddings. Fig. 4 visualizes the language-space noise embedding to show its representativeness of audio noise. First, we can observe from Fig. (a) that our extracted language embedding from the N-best list can well represent some noise types (i.e., ``ac'', ``babble'', ``cafe''), while the others are intertwined with clean embeddings, indicating less optimal noise representations. For reference, the audio noise embeddings in Fig. (c) distinguish well between different conditions. Therefore, we design a KD approach to distill the real noise information in audio embedding to our language embedding. Fig. (b) shows it disentangles the embeddings from different noise conditions and improves their noise representativeness, which leads to better WER results as shown in Table 14.

Data Efficiency. As shown in Table 4, we further discuss the data efficiency of RobustGER using the CHiME-4 dataset, whose training set contains 9.6k HT pairs decoded from 17.5-hour speech data. As we gradually reduce the training data, we find that using around half-size data (i.e., 5k pairs) can still maintain the WER performance, i.e., 6.3%percent6.36.3\%6.3 % vs. 5.9%percent5.95.9\%5.9 %. When it decreases to 2k pairs, RobustGER is still comparable to GER, i.e., 7.2%percent7.27.2\%7.2 % vs. 6.9%percent6.96.9\%6.9 %. This experimental evidence verifies the data efficiency of RobustGER, which may originate from the attribute of parameter-efficient LLM finetuning.

Case Study. Table 5 illustrates a case study to demonstrate the effectiveness of RobustGER. There are two errors in N-best hypotheses, i.e., ``write ups'' (in 1-best) and ``ride outs'', where the ground truth is ``write offs''. Both ChatGPT-based in-context learning and LLaMA-based GER fail to correct this error, because the words ``write ups'' and ``write offs'' sound quite similar under noisy scenarios. In comparison, our RobustGER can correct this error by language-space denoising, where our proposed noise-representative embedding teaches LLMs to remove the language noise in N-best hypotheses that is caused by audio noise. More importantly, the semantic meanings of ``write ups'' and ``write offs'' are opposite, which highlights the significance of successful error correction by our RobustGER.

Table 4: Data efficiency of RobustGER on CHiME-4 test sets. The ``1k'', ``2k'', etc., denote the number of HT pairs in training data, and ``Training Hours'' denote its duration of source speech data.
Test Set Baseline GER RobustGER
1k 2k 5k 8k 9.6k (full)
Training Hours - 17.517.517.517.5 1.71.71.71.7 3.53.53.53.5 9.29.29.29.2 14.514.514.514.5 17.517.517.517.5
test-real 12.612.612.612.6 6.548.4%subscript6.5percent48.46.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}6.5 start_POSTSUBSCRIPT - 48.4 % end_POSTSUBSCRIPT 9.326.2%subscript9.3percent26.29.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-26.2\%}}9.3 start_POSTSUBSCRIPT - 26.2 % end_POSTSUBSCRIPT 7.044.4%subscript7.0percent44.47.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-44.4\%}}7.0 start_POSTSUBSCRIPT - 44.4 % end_POSTSUBSCRIPT 5.953.2%subscript5.9percent53.25.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}5.9 start_POSTSUBSCRIPT - 53.2 % end_POSTSUBSCRIPT 5.754.8%subscript5.7percent54.85.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-54.8\%}}5.7 start_POSTSUBSCRIPT - 54.8 % end_POSTSUBSCRIPT 5.655.6%subscript5.6percent55.6\bm{5.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-55.6\%}}}bold_5.6 start_POSTSUBSCRIPT bold_- bold_55.6 bold_% end_POSTSUBSCRIPT
test-simu 15.415.415.415.4 9.240.3%subscript9.2percent40.39.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-40.3\%}}9.2 start_POSTSUBSCRIPT - 40.3 % end_POSTSUBSCRIPT 11.426.0%subscript11.4percent26.011.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-26.0\%}}11.4 start_POSTSUBSCRIPT - 26.0 % end_POSTSUBSCRIPT 9.538.3%subscript9.5percent38.39.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-38.3\%}}9.5 start_POSTSUBSCRIPT - 38.3 % end_POSTSUBSCRIPT 8.842.9%subscript8.8percent42.98.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-42.9\%}}8.8 start_POSTSUBSCRIPT - 42.9 % end_POSTSUBSCRIPT 8.445.5%subscript8.4percent45.58.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.5\%}}8.4 start_POSTSUBSCRIPT - 45.5 % end_POSTSUBSCRIPT 8.246.8%subscript8.2percent46.8\bm{8.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.8\%}}}bold_8.2 start_POSTSUBSCRIPT bold_- bold_46.8 bold_% end_POSTSUBSCRIPT
dev-real 10.610.610.610.6 5.052.8%subscript5.0percent52.85.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.8\%}}5.0 start_POSTSUBSCRIPT - 52.8 % end_POSTSUBSCRIPT 7.232.1%subscript7.2percent32.17.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-32.1\%}}7.2 start_POSTSUBSCRIPT - 32.1 % end_POSTSUBSCRIPT 5.250.9%subscript5.2percent50.95.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.9\%}}5.2 start_POSTSUBSCRIPT - 50.9 % end_POSTSUBSCRIPT 4.458.5%subscript4.4percent58.54.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-58.5\%}}4.4 start_POSTSUBSCRIPT - 58.5 % end_POSTSUBSCRIPT 4.161.3%subscript4.1percent61.34.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-61.3\%}}4.1 start_POSTSUBSCRIPT - 61.3 % end_POSTSUBSCRIPT 4.161.3%subscript4.1percent61.3\bm{4.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-61.3\%}}}bold_4.1 start_POSTSUBSCRIPT bold_- bold_61.3 bold_% end_POSTSUBSCRIPT
dev-simu 12.412.412.412.4 6.845.2%subscript6.8percent45.26.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.2\%}}6.8 start_POSTSUBSCRIPT - 45.2 % end_POSTSUBSCRIPT 8.928.2%subscript8.9percent28.28.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-28.2\%}}8.9 start_POSTSUBSCRIPT - 28.2 % end_POSTSUBSCRIPT 7.142.7%subscript7.1percent42.77.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-42.7\%}}7.1 start_POSTSUBSCRIPT - 42.7 % end_POSTSUBSCRIPT 6.250.0%subscript6.2percent50.06.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.0\%}}6.2 start_POSTSUBSCRIPT - 50.0 % end_POSTSUBSCRIPT 5.952.4%subscript5.9percent52.45.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.4\%}}5.9 start_POSTSUBSCRIPT - 52.4 % end_POSTSUBSCRIPT 5.853.2%subscript5.8percent53.2\bm{5.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}}bold_5.8 start_POSTSUBSCRIPT bold_- bold_53.2 bold_% end_POSTSUBSCRIPT
avg. 12.812.812.812.8 6.946.1%subscript6.9percent46.16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.1\%}}6.9 start_POSTSUBSCRIPT - 46.1 % end_POSTSUBSCRIPT 9.228.1%subscript9.2percent28.19.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-28.1\%}}9.2 start_POSTSUBSCRIPT - 28.1 % end_POSTSUBSCRIPT 7.243.8%subscript7.2percent43.87.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-43.8\%}}7.2 start_POSTSUBSCRIPT - 43.8 % end_POSTSUBSCRIPT 6.350.8%subscript6.3percent50.86.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}6.3 start_POSTSUBSCRIPT - 50.8 % end_POSTSUBSCRIPT 6.053.1%subscript6.0percent53.16.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.1\%}}6.0 start_POSTSUBSCRIPT - 53.1 % end_POSTSUBSCRIPT 5.953.9%subscript5.9percent53.9\bm{5.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.9\%}}}bold_5.9 start_POSTSUBSCRIPT bold_- bold_53.9 bold_% end_POSTSUBSCRIPT
Table 5: Case study of RobustGER. We also implement an in-context learning baseline by ChatGPT for comparison (details are in §8). The test sample is selected from the CHiME-4 dev-real set.
Method Utterance WER (%)
N-best List the four other utility company owners will also have to take write ups 7.77.77.77.7
the four other utility company owners will also have to take write ups 7.77.77.77.7
the four other utility company owners will also have to take write ups 7.77.77.77.7
the four other utility company owners will also have to take ride outs 15.415.415.415.4
the four other utility company owners will also have to take ride outs 15.415.415.415.4
In-context Learning the four other utility company owners will also have to take write-ups 15.415.415.415.4
GER the four other utility company owners will also have to take write ups 7.77.77.77.7
RobustGER the four other utility company owners will also have to take write offs 0.00.0\bm{0.0}bold_0.0
Ground Truth the four other utility company owners will also have to take write offs -

6 Conclusion

In this paper, we first extend the latest ASR generative error correction benchmark to the most common noisy scenarios in real world, with a proposed RobustHP dataset containing 113K hypotheses-transcription pairs decoded from various noisy ASR corpus. Based on that, we propose RobustGER, a noise-aware generative error correction approach based on LLMs to predict the ground-truth transcription based on N-best hypotheses, where an extracted language-space noise embedding with audio distillation is leveraged to teach LLMs to perform denoising in language space. Extensive experiments on various latest LLMs show that our approach achieves a new breakthrough on RobustHP dataset with up to 53.9% error correction improvement in terms of WER while with limited training data. Further analysis verifies the effectiveness of our proposed language-space embedding to represent audio noise, under which off-the-shelf LLMs show strong ability of language-space denoising.

References

  • Arisoy et al. (2015) Ebru Arisoy, Abhinav Sethy, Bhuvana Ramabhadran, and Stanley Chen. Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5421–5425. IEEE, 2015.
  • Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Chen et al. (2023a) Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, and Eng Siong Chng. Generative error correction for code-switching speech recognition using large language models. arXiv preprint arXiv:2310.13013, 2023a.
  • Chen et al. (2023b) Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, and Ensiong Chng. Hyporadise: An open baseline for generative speech recognition with large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b.
  • Chen et al. (2023c) Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, **g Shi, Shuang Xu, and Bo Xu. X-llm: Bootstrap** advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023c.
  • Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, **g**g Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  • Fathullah et al. (2023) Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, **xi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, et al. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
  • Feldman et al. (2023) Philip Feldman, James R Foulds, and Shimei Pan. Trap** llm hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085, 2023.
  • Font et al. (2013) Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, pp.  411–412, 2013.
  • Fu et al. (2019) Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In International Conference on Machine Learning, pp. 2031–2041. PMLR, 2019.
  • Gao et al. (2023) Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  • Gong et al. (2023a) Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, 2023a.
  • Gong et al. (2023b) Yuan Gong, Alexander Liu, Hongyin Luo, Leonid Karlinsky, and James Glass. Joint audio and speech understanding. In IEEE Proc. ASRU, 2023b.
  • Graff et al. (2014) David Graff, Kevin Walker, Stephanie M Strassel, Xiaoyi Ma, Karen Jones, and Ann Sawyer. The rats collection: Supporting hlt research with degraded audio data. In LREC, pp.  1970–1977. Citeseer, 2014.
  • Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  • Guo et al. (2019) **xi Guo, Tara N Sainath, and Ron J Weiss. A spelling correction model for end-to-end speech recognition. In Proc. ICASSP, pp.  5651–5655. IEEE, 2019.
  • Hirsch & Pearce (2000) Hans-Günter Hirsch and David Pearce. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRW), 2000.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Hu et al. (2020) Ke Hu, Tara N Sainath, Ruoming Pang, and Rohit Prabhavalkar. Deliberation model based two-pass end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7799–7803. IEEE, 2020.
  • Hu et al. (2022) Ke Hu, Tara N Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, and Weiran Wang. Improving deliberation by text-only and semi-supervised training. arXiv preprint arXiv:2206.14716, 2022.
  • Hu et al. (2023) Ke Hu, Bo Li, and Tara N Sainath. Scaling up deliberation for multilingual asr. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 771–776. IEEE, 2023.
  • Hu & Loizou (2006) Yi Hu and Philipos C Loizou. Subjective comparison of speech enhancement algorithms. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 1, pp.  I–I. IEEE, 2006.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krishna et al. (2019) Gautam Krishna, Co Tran, Jianguo Yu, and Ahmed H Tewfik. Speech recognition with no speech or with noisy speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1090–1094. IEEE, 2019.
  • Leng et al. (2021) Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, ** Xu, Wenjie Liu, Linquan Liu, Tao Qin, Xiang-Yang Li, Edward Lin, et al. Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. arXiv preprint arXiv:2109.14420, 2021.
  • Li et al. (2014) **yu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745–777, 2014.
  • Li et al. (2015) **yu Li, Li Deng, Reinhold Haeb-Umbach, and Yifan Gong. Robust automatic speech recognition: a bridge to practical applications, chapter 1, pp.  1–20. Academic Press, 2015.
  • Li et al. (2023a) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  • Li et al. (2022) Yanxi Li, Xinghao Chen, Min**g Dong, Yehui Tang, Yunhe Wang, and Chang Xu. Spatial-channel token distillation for vision mlps. In International Conference on Machine Learning, pp. 12685–12695. PMLR, 2022.
  • Li et al. (2023b) Yuang Li, Yu Wu, **yu Li, and Shujie Liu. Prompting large language models for zero-shot domain adaptation in speech recognition. arXiv preprint arXiv:2306.16007, 2023b.
  • Lin et al. (2021) Hsin-Yi Lin, Huan-Hsin Tseng, Xugang Lu, and Yu Tsao. Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport. Advances in Neural Information Processing Systems, 34:19935–19946, 2021.
  • Loshchilov & Hutter (2018) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  • Ma et al. (2023) Rao Ma, Mark JF Gales, Kate Knill, and Mengjie Qian. N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space. arXiv preprint arXiv:2303.00456, 2023.
  • Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, pp.  1045–1048. Makuhari, 2010.
  • OpenAI (2022) OpenAI. Introducing chatgpt. OpenAI Blog, 2022.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  • Pandey et al. (2021) Ashutosh Pandey, Chunxi Liu, Yun Wang, and Yatharth Saraf. Dual application of speech enhancement for automatic speech recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 223–228. IEEE, 2021.
  • Park et al. (2023) Tae ** Park, Kunal Dhawan, Nithin Koluguri, and Jagadeesh Balam. Enhancing speaker diarization with large language models: A contextual beam search approach. arXiv preprint arXiv:2309.05248, 2023.
  • Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  • Prasad et al. (2021) Archiki Prasad, Preethi Jyothi, and Rajbabu Velmurugan. An investigation of end-to-end models for robust speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6893–6897. IEEE, 2021.
  • Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
  • Radhakrishnan et al. (2023) Srijith Radhakrishnan, Chao-Han Yang, Sumeer Khan, Rohit Kumar, Narsis Kiani, David Gomez-Cabrero, and Jesper Tegnér. Whispering llama: A cross-modal generative error correction framework for speech recognition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10007–10016, 2023.
  • Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  • Shin et al. (2019) Joonbo Shin, Yoonhyung Lee, and Kyomin Jung. Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pp.  1081–1093. PMLR, 2019.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Valentini-Botinhao et al. (2016) Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In SSW, pp.  146–152, 2016.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Veaux et al. (2013) Christophe Veaux, Junichi Yamagishi, and Simon King. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 O-COCOSDA/CASLRE, pp.  1–4, 2013.
  • Vincent et al. (2016) Emmanuel Vincent, Shinji Watanabe, Jon Barker, and Ricard Marxer. The 4th chime speech separation and recognition challenge. URL: http://spandh. dcs. shef. ac. uk/chime challenge {normal-{\{{Last Accessed on 1 August, 2018}normal-}\}}, 2016.
  • Wang et al. (2024) Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, and Hank Liao. Diarizationlm: Speaker diarization post-processing with large language models. arXiv preprint arXiv:2401.03506, 2024.
  • Wang et al. (2023) Siyin Wang, Chao-Han Huck Yang, Ji Wu, and Chao Zhang. Can whisper perform speech-based in-context learning. arXiv preprint arXiv:2309.07081, 2023.
  • Watanabe et al. (2018) Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015, 2018.
  • Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  • Wu et al. (2023a) Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, **yu Li, Shujie Liu, Bo Ren, Linquan Liu, et al. On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917, 2023a.
  • Wu et al. (2023b) Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe. Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation. arXiv preprint arXiv:2309.17352, 2023b.
  • Yang et al. (2021) Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju, Denis Filimonov, and Ivan Bulyko. Multi-task language modeling for improving speech recognition of rare words. In Proc. IEEE ASRU, pp.  1087–1093. IEEE, 2021.
  • Yang et al. (2023a) Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, and Andreas Stolcke. Generative speech recognition error correction with large language models and task-activating prompting. In Proc. IEEE ASRU, 2023a.
  • Yang et al. (2023b) Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N Sainath, and Trevor Strohman. From english to more languages: Parameter-efficient model reprogramming for cross-lingual speech recognition. In Proc. ICASSP, pp.  1–5. IEEE, 2023b.
  • Yu et al. (2023) Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, et al. Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In IEEE Proc. ASRU, 2023.
  • Zhang et al. (2023a) Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  • Zhang et al. (2023b) Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  • Zhao et al. (2021) Long Zhao, Yuxiao Wang, Jia** Zhao, Liangzhe Yuan, Jennifer J Sun, Florian Schroff, Hartwig Adam, Xi Peng, Dimitris Metaxas, and Ting Liu. Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12793–12802, 2021.
  • Zhu et al. (2021) Hao Zhu, Huaibo Huang, Yi Li, Aihua Zheng, and Ran He. Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2362–2368, 2021.

Appendix

Appendix A Robust HyPoradise Dataset Details

Table 6: Robust HyPoradise dataset statistics in terms of number of hypotheses-transcription pairs and average utterance length in various noise domains.
Domain Training Set # Pairs Length Test Set # Pairs Length
Source Category
CHiME-4 Real-world noise tr05-real 9,600 17.0 test-real 1,320 16.4
test-simu 1,320 16.4
dev-real 1,640 16.8
dev-simu 1,640 16.8
VB-DEMAND Unseen noise train 23,075 7.5 baby-cry 824 7.7
helicopter
crowd-party
NOIZEUS Real-world noise train 23,807 7.1 babble 30 8.1
car
station
train
street
airport
exhibition
restaurant
LS-FreeSound Real-world noise train 28,539 35.0 metro 118 17.4
car
traffic
cafe
babble
ac/vacuum
RATS Radio noise train 28,504 14.2 test 1,000 10.2
Total train 113,525 16.8 test 10,340 13.7

A.1 ASR system

For ASR beam search decoding, we employ Whisper Large-V2 (Radford et al., 2023), one large-scale pre-trained model developed by OpenAI to generate N-best hypotheses, which has been reported with several competitive and state-of-the-art performance. Whisper model follows the encoder-decoder Transformer (Vaswani et al., 2017) architecture with 1,550 million parameters, which is trained on 680K hours of multilingual and multitask supervised data collected from the web. As a result, it shows universal and excellent noise-robustness in various conditions though lacks of domain specificity (i.e., still lags behind the specifically trained model on certain dataset).

With such pre-trained ASR model, we employ the beam search algorithm for decoding and generate N-best hypotheses list for each speech sample, where the beam size is set to 50. After removing repetitive utterances, we select top-5 hypotheses in terms of posterior probabilities as N-best list. To develop the RobustHP dataset, we carry out this decoding strategy on multiple noisy ASR corpus (see §A.2) and generate data pairs of 5-best hypotheses and ground-truth transcription.

A.2 Speech Corpus Selection

For speech corpus selection, our goal is to cover common noisy ASR scenarios in real world. Consequently, we collect and simulate the following corpus with evident domain characteristics to compose the Robust HyPoradise dataset:

CHiME-4 (Vincent et al., 2016): CHiME-4 is a popular dataset for far-field noisy speech recognition. It includes real and simulated noisy recordings in four noisy environments, i.e., bus, cafe, pedestrian area, and street junction. We use its tr05-real split (9,600 utterances) to generate RobustHP training data, as well as the test-real (1,320 utterances), test-simu (1,320 utterances), dev-real (1,640 utterances) and dev-simu(1,640 utterances) splits to generate the test data.

VoiceBank-DEMAND (Valentini-Botinhao et al., 2016): VoiceBank-DEMAND is a popular dataset for noise-robust speech recognition and speech enhancement. We use its training data for RobustHP generation, which contains 23,075 noisy utterances from 56 speakers in VoiceBank corpus (Veaux et al., 2013) that are recorded at sampling rate of 16 kHz and mixed with 10 different noise types (babble, cafeteria, car, kitchen, meeting, metro, restaurant, speech-shaped noise, station, traffic) at SNR levels of 0, 5, 10, and 15 dB. For test set, to simulate the challenging unseen noise conditions in practical, we mix the VoiceBank clean test data with three new types of noise (Lin et al., 2021), i.e., baby-cry, helicopter, and crowd-party, at SNR level of 0dB. The test set contains 824 utterances from 2 speakers.

NOIZEUS (Hu & Loizou, 2006): NOIZEUS is a noisy speech corpus developed to evaluate noise-robust speech recognition and speech enhancement algorithms. It only contains a test set of 30 IEEE sentences (produced by 3 male and 3 female speakers) corrupted by 8 different real-world noises at SNR levels of 0, 5, 10, and 15 dB, where we select 5 dB for main experiments. The noise was taken from the AURORA-2 database (Hirsch & Pearce, 2000) that includes suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train-station noise. To match the short length of NOIZEUS test utterances (8.1 tokens in average), we select the clean speech from LibriSpeech train-clean-100 and VoiceBank corpus that with no more than 12 tokens in transcription, and mix them with AURORA-2 noises at SNR levels of 0, 5, 10, 15, and 20 dB to form training set.

LibriSpeech-FreeSound (Prasad et al., 2021): LibriSpeech-FreeSound is a simulated noisy speech corpus for noise-robust speech recognition, which mixes the clean speech data from LibriSpeech train-clean-100 split (Panayotov et al., 2015) and noise data from FreeSound corpus (Font et al., 2013) at SNRs of 0, 5, 10, 15, 20, and 25 dB to form the training set. For test set, they select 118 clean speech samples from LibriSpeech test-clean split and mix them with FreeSound noise at SNRs of 0, 5, 10, 15, and 20 dB, where we select 0 dB for main experiments. Six noise types in FreeSound are employed, including metro, car, traffic, cafe, babble and ac/vacuum.

RATS (Graff et al., 2014): Robust Automatic Transcription of Speech (RATS) dataset contains radio-communication speech in ultra high frequency data category that is extremely noisy and challenging for ASR task. Its training data contains 43,112 noisy speech utterances, where we filter out the low-quality samples (i.e., WER by Whisper is larger than 0.9) to form the training set. Its test set contains 7,591 utterances, where we randomly select 1,000 samples for higher evaluation efficiency.

A.3 Statistics

After performing beam search decoding on the selected speech corpus introduced above, we collect 113K pairs of N-best hypotheses and ground-truth transcription to form the RobustHP dataset. The statistics are presented in Table 6, which illustrates the number of hypotheses-transcription pairs and the average utterance length in various domains and splits. We would release the RobustHP dataset to public upon publication and open the development venue for more data.

Appendix B Method Details

B.1 Denoised LLM Finetuning

B.1.1 Efficient LLM Finetuning: LLaMA-Adapter

Refer to caption
Figure 5: LLaMA-Adapter tuning (Zhang et al., 2023b) with language-space denoising (ours).

As presented in Fig. 5, we employ LLaMA-Adapter (Zhang et al., 2023b) for efficient LLM finetuning. Given pre-trained LLM with a H𝐻Hitalic_H-layer Transformer, it inserts a set of learnable adaptation prompts into the top-L𝐿Litalic_L layers that learn high-level semantics. Denote the prompt for l𝑙litalic_l-th Transformer layer as 𝒢lU×Dsubscript𝒢𝑙superscript𝑈𝐷{\mathcal{G}}_{l}\in\mathbb{R}^{U\times D}caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_D end_POSTSUPERSCRIPT, where U𝑈Uitalic_U denotes the prompt length and D𝐷Ditalic_D denotes the LLM embedding size.

Assume we have M𝑀Mitalic_M tokens containing instruction and already generated response, i.e., TlM×Dsubscript𝑇𝑙superscript𝑀𝐷T_{l}\in\mathbb{R}^{M\times D}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_D end_POSTSUPERSCRIPT, where l𝑙litalic_l is the layer index, now we aim to predict the (M+1)𝑀1(M+1)( italic_M + 1 )-th token as part of response. In order to finetune the entire system, the learnable adaptation prompt is concatenated with Tlsubscript𝑇𝑙T_{l}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as prefix, i.e., [𝒢l;Tl](U+M)×Dsubscript𝒢𝑙subscript𝑇𝑙superscript𝑈𝑀𝐷[{\mathcal{G}}_{l};T_{l}]\in\mathbb{R}^{(U+M)\times D}[ caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_U + italic_M ) × italic_D end_POSTSUPERSCRIPT. In this case, the instruction knowledge learned by 𝒢lsubscript𝒢𝑙{\mathcal{G}}_{l}caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT can guide the Tlsubscript𝑇𝑙T_{l}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to generate the subsequent response under teacher-forcing supervision.

Furthermore, considering the prompt 𝒢lsubscript𝒢𝑙{\mathcal{G}}_{l}caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is randomly initialized and thus may disturb the LLM tuning at early training stages, a zero-initialized attention mechanism is designed to mitigate such disturbance. Suppose the LLM is going to generate the (M+1)𝑀1(M+1)( italic_M + 1 )-th token based on the prompt 𝒢lsubscript𝒢𝑙{\mathcal{G}}_{l}caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and history tokens Tlsubscript𝑇𝑙T_{l}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at l𝑙litalic_l-th layer, and we denote the current M𝑀Mitalic_M-th token as Tl(M)1×Dsuperscriptsubscript𝑇𝑙𝑀superscript1𝐷T_{l}^{(M)}\in\mathbb{R}^{1\times D}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_D end_POSTSUPERSCRIPT. In attention mechanism, there are firstly three projection layers to generate query, key and value, respectively:

Ql=Linearq(Tl(M)),Kl=Lineark([𝒢l;Tl]),Vl=Linearv([𝒢l;Tl]),formulae-sequencesubscript𝑄𝑙subscriptLinear𝑞superscriptsubscript𝑇𝑙𝑀formulae-sequencesubscript𝐾𝑙subscriptLinear𝑘subscript𝒢𝑙subscript𝑇𝑙subscript𝑉𝑙subscriptLinear𝑣subscript𝒢𝑙subscript𝑇𝑙Q_{l}=\mathrm{Linear}_{q}(T_{l}^{(M)}),\quad K_{l}=\mathrm{Linear}_{k}([{% \mathcal{G}}_{l};T_{l}]),\quad V_{l}=\mathrm{Linear}_{v}([{\mathcal{G}}_{l};T_% {l}]),italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_Linear start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) , italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_Linear start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( [ caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] ) , italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = roman_Linear start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( [ caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] ) , (9)

Thereafter, the attention score between key and value can be formulated as Al=QlKl/D1×(U+M)subscript𝐴𝑙subscript𝑄𝑙subscript𝐾𝑙𝐷superscript1𝑈𝑀A_{l}=Q_{l}\cdot K_{l}/\sqrt{D}\in\mathbb{R}^{1\times(U+M)}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT / square-root start_ARG italic_D end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 1 × ( italic_U + italic_M ) end_POSTSUPERSCRIPT, which captures the correlation between current token Tl(M)superscriptsubscript𝑇𝑙𝑀T_{l}^{(M)}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT and all M𝑀Mitalic_M existed tokens Tlsubscript𝑇𝑙T_{l}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as well as the prompt 𝒢lsubscript𝒢𝑙{\mathcal{G}}_{l}caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to predict next token. Therefore, Alsubscript𝐴𝑙A_{l}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT could be split into two parts:

Al=[Al𝒢;AlT]T,subscript𝐴𝑙superscriptsuperscriptsubscript𝐴𝑙𝒢superscriptsubscript𝐴𝑙𝑇𝑇A_{l}=[A_{l}^{\mathcal{G}};A_{l}^{T}]^{T},italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = [ italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_G end_POSTSUPERSCRIPT ; italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (10)

where Al𝒢U×1superscriptsubscript𝐴𝑙𝒢superscript𝑈1A_{l}^{\mathcal{G}}\in\mathbb{R}^{U\times 1}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_G end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × 1 end_POSTSUPERSCRIPT denotes the attention score of U𝑈Uitalic_U adaptation prompts and AlTM×1superscriptsubscript𝐴𝑙𝑇superscript𝑀1A_{l}^{T}\in\mathbb{R}^{M\times 1}italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT denotes that of M𝑀Mitalic_M history tokens. Since the adaptation prompts are randomly initialized, their attention scores may cast disturbance on next-token prediction in early training stages. To this end, a learnable gating factor glsubscript𝑔𝑙g_{l}italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with zero initialization is introduced to adaptively control the importance of prompt in attention, by directly multiplied with its softmax weights from Eq.(10):

Alg=[glsoftmax(Al𝒢);softmax(AlT)]T,superscriptsubscript𝐴𝑙𝑔superscriptsubscript𝑔𝑙softmaxsuperscriptsubscript𝐴𝑙𝒢softmaxsuperscriptsubscript𝐴𝑙𝑇𝑇A_{l}^{g}=[g_{l}\cdot\mathrm{softmax}(A_{l}^{\mathcal{G}});\hskip 2.84544pt% \mathrm{softmax}(A_{l}^{T})]^{T},italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = [ italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_softmax ( italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_G end_POSTSUPERSCRIPT ) ; roman_softmax ( italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (11)

Finally, the attention output of l𝑙litalic_l-th Transformer layer can be calculated with a linear projection:

Ol(M)=Linearo(AlgVl)1×D,superscriptsubscript𝑂𝑙𝑀subscriptLinear𝑜superscriptsubscript𝐴𝑙𝑔subscript𝑉𝑙superscript1𝐷O_{l}^{(M)}=\mathrm{Linear}_{o}(A_{l}^{g}\cdot V_{l})\in\mathbb{R}^{1\times D},italic_O start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT = roman_Linear start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ⋅ italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_D end_POSTSUPERSCRIPT , (12)

It is then utilized to predict the next token Tl(M+1)superscriptsubscript𝑇𝑙𝑀1T_{l}^{(M+1)}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_M + 1 ) end_POSTSUPERSCRIPT as part of output response. The proposed zero-initialization mechanism achieves an effective trade-off between the pre-trained knowledge of LLM and the learned instructional knowledge through adaptation prompt.

B.1.2 Denoised Adapter Tuning

Apart from text instructions, LLaMA-Adapter is also capable of generating response based on other modality inputs (Zhang et al., 2023b). However, the cross-modal gap between text and other modalities may affect the finetuning stability and performance (Li et al., 2023b). Therefore, we propose to extract a language-space noise embedding in §4.2 to replace audio embedding for representing the noise conditions of source speech, i.e., ELN=[ELNutt;ELNtok]N(N1)×Dsbertsubscript𝐸LNsuperscriptsubscript𝐸LN𝑢𝑡𝑡superscriptsubscript𝐸LN𝑡𝑜𝑘superscript𝑁𝑁1subscript𝐷sbertE_{\text{LN}}=[E_{\text{LN}}^{utt};E_{\text{LN}}^{tok}]\in\mathbb{R}^{N\cdot(N% -1)\times D_{\text{sbert}}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT = [ italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_t italic_t end_POSTSUPERSCRIPT ; italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_o italic_k end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N ⋅ ( italic_N - 1 ) × italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT end_POSTSUPERSCRIPT according to Eq.(9-12), where N𝑁Nitalic_N denotes N-best list size and Dsbertsubscript𝐷sbertD_{\text{sbert}}italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT denotes SBERT embedding size. Then, we incorporate it into LLaMA-Adapter for denoising via element-wise subtraction:

𝒢ldn=𝒢lgldn𝒯ω(ELN)U×D,we setU=N(N1),formulae-sequencesuperscriptsubscript𝒢𝑙dnsubscript𝒢𝑙superscriptsubscript𝑔𝑙dnsubscript𝒯𝜔subscript𝐸LNsuperscript𝑈𝐷we set𝑈𝑁𝑁1{\mathcal{G}}_{l}^{\text{dn}}={\mathcal{G}}_{l}-g_{l}^{\text{dn}}\cdot\mathcal% {T}_{\omega}(E_{\text{LN}})\in\mathbb{R}^{U\times D},\quad\text{we set}\hskip 5% .69046ptU=N\cdot(N-1),caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dn end_POSTSUPERSCRIPT = caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dn end_POSTSUPERSCRIPT ⋅ caligraphic_T start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_D end_POSTSUPERSCRIPT , we set italic_U = italic_N ⋅ ( italic_N - 1 ) , (13)

where 𝒯ωD×Dsbertsubscript𝒯𝜔superscript𝐷subscript𝐷sbert\mathcal{T}_{\omega}\in\mathbb{R}^{D\times D_{\text{sbert}}}caligraphic_T start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the linear projection tuner introduced in §4.3 for audio noise distillation, the subtraction operation denotes ``denoise''. The gldnsuperscriptsubscript𝑔𝑙dng_{l}^{\text{dn}}italic_g start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dn end_POSTSUPERSCRIPT is a gating factor to control denoising process. Therefore, the resulted 𝒢ldnsuperscriptsubscript𝒢𝑙dn{\mathcal{G}}_{l}^{\text{dn}}caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT dn end_POSTSUPERSCRIPT indicates the adaption prompt with language-space denoising, which will replace the 𝒢lsubscript𝒢𝑙{\mathcal{G}}_{l}caligraphic_G start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in Eq.(9-12) for adapter tuning.

B.2 Audio Noise Distillation

As illustrated in §4.3, the key idea of audio noise distillation is to transfer the real noise information in audio embeddings to our extracted language-space noise embedding, in order to enhance its representation ability of audio noise. The approach we propose is based on mutual information neural estimation (MINE) (Belghazi et al., 2018), which can be split into two stages in Algorithm 1. First, we update the MINE to learn MI estimation, by maximizing the MI between language-space noise embedding and noisy audio embeddings and minimizing the MI between language embedding and clean audio embeddings, i.e., audio noise information exists in noisy speech instead of clean speech. Second, we introduce a learnable tuner to modulate the language-space embedding to include more real noise information by maximizing the MI between it and noisy audio embeddings, which is also jointly optimized with LLM finetuning (i.e., the GER cost function H2TsubscriptH2T\mathcal{L}_{\text{H2T}}caligraphic_L start_POSTSUBSCRIPT H2T end_POSTSUBSCRIPT as formulated in Eq.(2)).

The rationale we leverage MINE for distillation instead of other techniques like contrastive learning is due to its strong distinguishing ability, which has been verified by recent applications (Zhu et al., 2021; Zhao et al., 2021; Li et al., 2022). On the other hand, directly employing techniques such as contrastive learning may not work as the language embedding could be far away from the audio-space noisy and clean embeddings, which means the distance between positive and negative samples (i.e., within audio space) is much smaller than the distance between them and the anchor (i.e., between audio and language spaces). Our ablation study in Table 14 also verifies this limitation.

Appendix C Experimental Setup Details

C.1 Model Setups

Table 7: Comparison between main configurations of different popular LLMs.
LLM LLaMA-2-7b LLaMA-7b Falcon-7b LLaMA-2-13b
Number of Transformer Layers H𝐻Hitalic_H 32 32 32 40
Number of Attention Heads Nheadsubscript𝑁headN_{\text{head}}italic_N start_POSTSUBSCRIPT head end_POSTSUBSCRIPT 32 32 71 40
Embedding Size D𝐷Ditalic_D 4,096 4,096 4,544 5,120
Block Size B𝐵Bitalic_B 4,096 2,048 2,048 4,096
Vocabulary Size V𝑉Vitalic_V 32,000 32,000 65,024 32,000

LLMs. We select three latest and popular LLMs for evaluation, including LLaMA-2-7b444https://huggingface.co/meta-llama/Llama-2-7b-hf (Touvron et al., 2023b), LLaMA-7b555https://huggingface.co/yahma/llama-7b-hf (Touvron et al., 2023a), Falcon-7b666https://huggingface.co/tiiuae/falcon-7b (Penedo et al., 2023). In addition, to explore the influence of LLM model size to our approach, we also report some results on LLaMA-2-13b model777https://huggingface.co/meta-llama/Llama-2-13b-hf (Touvron et al., 2023b). Table 7 compares their main configurations.

Adapter. We follow the default setting of LLaMA-Adapter (Zhang et al., 2023b)888https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/adapter.py,999https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/adapter.py with some modifications. The number of tunable Transformer layers L𝐿Litalic_L is set to H1𝐻1H-1italic_H - 1, which means all layers except the first one are tunable with inserted prompts. The prompt length U𝑈Uitalic_U is set to 20 to match the length of ELNsubscript𝐸LNE_{\text{LN}}italic_E start_POSTSUBSCRIPT LN end_POSTSUBSCRIPT that equals to N(N1)𝑁𝑁1N\cdot(N-1)italic_N ⋅ ( italic_N - 1 ), where N𝑁Nitalic_N is the N-best list size set to 5. To extract the language-space noise embedding from N-best hypotheses, we utilize sentence-BERT101010https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (Reimers & Gurevych, 2019) whose embedding size Dsbertsubscript𝐷sbertD_{\text{sbert}}italic_D start_POSTSUBSCRIPT sbert end_POSTSUBSCRIPT is 384.

MINE. MINE introduces a statistic network ψ𝜽subscript𝜓𝜽\psi_{\bm{\theta}}italic_ψ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT that contains a multi-layer perceptron (MLP) and a Sigmoid activation function to estimate a mutual information value between 0 and 1. It receives two inputs including the Whisper-encoded audio embeddings of size 1280 and the language-space noise embedding of size 384, which are first projected to same hidden dimension and added together, and then go through MLP to generate output of size 1. In particular, to incorporate the modulated noise embedding (with same size as LLM embedding, different from the input language embedding of size 384) into MINE, we design an extra interface to receive it as intermediate features on language-space feature branch. The noise embedding tuner contains a linear projection from the SBERT size of 384 to the LLM embedding size as described in §B.1.2.

C.2 Training and Evaluation Setups

LLM Finetuning. The learning rate is set to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for CHiME-4 that is relatively small, and set to 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for relatively large datasets including VB-DEMAND, NOIZEUS, LS-FreeSound and RATS. The batch size is set to 4, with accumulation iterations set to 8 (e.g., effective batch size is 32). We train 2 epochs with AdamW optimizer (Loshchilov & Hutter, 2018), with weight decay set to 0.02 and warmup steps set to 20% of one epoch's steps. In addition, MINE is updated using an extra AdamW optimizer with learning rate that is 10% of LLM tuning, where all other configurations keep the same. The hyper-parameter λ𝜆\lambdaitalic_λ in Algorithm 1 is set to 0.5. We use 1 NVIDIA A40 GPU for model training, which takes 1.5 hours for CHiME-4, 2.0 hours for VB-DEMAND, 1.6 hours for NOIZEUS, 4.5 hours for LS-FreeSound, and 3.8 hours for RATS, respectively.

Instruction-following Finetuning. As presented in Fig. 1, we leverage instruction-following finetuning strategy for GER, where we design an instruction template:

``Below is the best-hypotheses transcribed from speech recognition system. Please try to revise it using the words which are only included into other-hypothesis, and write the response for the true transcription.### Best-hypothesis:{1-best hypothesis}### Other-hypothesis:{2similar-to\simN-best hypotheses}### Response:''

We find that different instruction templates would have slight impact on the final GER performance, which is an open question for further discussion. In particular, we design some constraints (e.g., only use the words inside N-best hypotheses list for error correction) to control the quality of response and avoid potential LLM hallucinations (Feldman et al., 2023).

Table 8: WER (%) results of RobustGER with LLaMA-7b finetuning. ``LMranksubscriptLM𝑟𝑎𝑛𝑘\text{LM}_{rank}LM start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k end_POSTSUBSCRIPT'' denotes LM rescoring. ``+ Audio Denoising'' denotes introducing audio embedding to denoise GER. onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT and ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT respectively denote the N-best oracle and compositional oracle that are defined in §5.1. The subscript percentage denotes relative WER reduction over ASR baseline, i.e., GER improvement.
Test Set Baseline LMrank𝑟𝑎𝑛𝑘{}_{rank}start_FLOATSUBSCRIPT italic_r italic_a italic_n italic_k end_FLOATSUBSCRIPT GER + Audio Denoising RobustGER Oracle
(ours) onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT
CHiME-4 test-real 12.612.612.612.6 12.212.212.212.2 6.846.0%subscript6.8percent46.06.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.0\%}}6.8 start_POSTSUBSCRIPT - 46.0 % end_POSTSUBSCRIPT 6.647.6%subscript6.6percent47.66.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.6\%}}6.6 start_POSTSUBSCRIPT - 47.6 % end_POSTSUBSCRIPT 5.754.8%subscript5.7percent54.8\bm{5.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-54.8\%}}}bold_5.7 start_POSTSUBSCRIPT bold_- bold_54.8 bold_% end_POSTSUBSCRIPT 10.510.510.510.5 3.03.03.03.0
test-simu 15.415.415.415.4 14.514.514.514.5 10.134.4%subscript10.1percent34.410.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-34.4\%}}10.1 start_POSTSUBSCRIPT - 34.4 % end_POSTSUBSCRIPT 9.737.0%subscript9.7percent37.09.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-37.0\%}}9.7 start_POSTSUBSCRIPT - 37.0 % end_POSTSUBSCRIPT 8.544.8%subscript8.5percent44.8\bm{8.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-44.8\%}}}bold_8.5 start_POSTSUBSCRIPT bold_- bold_44.8 bold_% end_POSTSUBSCRIPT 12.912.912.912.9 5.05.05.05.0
dev-real 10.610.610.610.6 10.310.310.310.3 4.953.8%subscript4.9percent53.84.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.8\%}}4.9 start_POSTSUBSCRIPT - 53.8 % end_POSTSUBSCRIPT 4.755.7%subscript4.7percent55.74.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-55.7\%}}4.7 start_POSTSUBSCRIPT - 55.7 % end_POSTSUBSCRIPT 4.062.3%subscript4.0percent62.3\bm{4.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-62.3\%}}}bold_4.0 start_POSTSUBSCRIPT bold_- bold_62.3 bold_% end_POSTSUBSCRIPT 9.19.19.19.1 2.12.12.12.1
dev-simu 12.412.412.412.4 11.911.911.911.9 6.944.4%subscript6.9percent44.46.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-44.4\%}}6.9 start_POSTSUBSCRIPT - 44.4 % end_POSTSUBSCRIPT 6.845.2%subscript6.8percent45.26.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.2\%}}6.8 start_POSTSUBSCRIPT - 45.2 % end_POSTSUBSCRIPT 6.349.2%subscript6.3percent49.2\bm{6.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-49.2\%}}}bold_6.3 start_POSTSUBSCRIPT bold_- bold_49.2 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 3.33.33.33.3
avg. 12.812.812.812.8 12.212.212.212.2 7.243.8%subscript7.2percent43.87.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-43.8\%}}7.2 start_POSTSUBSCRIPT - 43.8 % end_POSTSUBSCRIPT 7.045.3%subscript7.0percent45.37.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.3\%}}7.0 start_POSTSUBSCRIPT - 45.3 % end_POSTSUBSCRIPT 6.152.3%subscript6.1percent52.3\bm{6.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.3\%}}}bold_6.1 start_POSTSUBSCRIPT bold_- bold_52.3 bold_% end_POSTSUBSCRIPT 10.810.810.810.8 3.43.43.43.4
VB-DEMAND baby-cry 8.08.08.08.0 7.87.87.87.8 7.111.3%subscript7.1percent11.37.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.3\%}}7.1 start_POSTSUBSCRIPT - 11.3 % end_POSTSUBSCRIPT 7.210.0%subscript7.2percent10.07.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}7.2 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 6.518.8%subscript6.5percent18.8\bm{6.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.8\%}}}bold_6.5 start_POSTSUBSCRIPT bold_- bold_18.8 bold_% end_POSTSUBSCRIPT 4.54.54.54.5 3.03.03.03.0
helicopter 8.48.48.48.4 8.18.18.18.1 7.313.1%subscript7.3percent13.17.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.1\%}}7.3 start_POSTSUBSCRIPT - 13.1 % end_POSTSUBSCRIPT 7.214.3%subscript7.2percent14.37.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.3\%}}7.2 start_POSTSUBSCRIPT - 14.3 % end_POSTSUBSCRIPT 6.819.0%subscript6.8percent19.0\bm{6.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-19.0\%}}}bold_6.8 start_POSTSUBSCRIPT bold_- bold_19.0 bold_% end_POSTSUBSCRIPT 4.84.84.84.8 3.23.23.23.2
crowd-party 22.622.622.622.6 22.322.322.322.3 21.54.9%subscript21.5percent4.921.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.9\%}}21.5 start_POSTSUBSCRIPT - 4.9 % end_POSTSUBSCRIPT 21.16.6%subscript21.1percent6.621.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.6\%}}21.1 start_POSTSUBSCRIPT - 6.6 % end_POSTSUBSCRIPT 20.111.1%subscript20.1percent11.1\bm{20.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 1.1\%}}}bold_20.1 start_POSTSUBSCRIPT bold_- bold_11.1 bold_% end_POSTSUBSCRIPT 16.516.516.516.5 11.511.511.511.5
avg. 13.013.013.013.0 12.712.712.712.7 12.07.7%subscript12.0percent7.712.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.7\%}}12.0 start_POSTSUBSCRIPT - 7.7 % end_POSTSUBSCRIPT 11.89.2%subscript11.8percent9.211.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.2\%}}11.8 start_POSTSUBSCRIPT - 9.2 % end_POSTSUBSCRIPT 11.114.6%subscript11.1percent14.6\bm{11.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 4.6\%}}}bold_11.1 start_POSTSUBSCRIPT bold_- bold_14.6 bold_% end_POSTSUBSCRIPT 8.68.68.68.6 5.95.95.95.9
NOIZEUS babble 16.516.516.516.5 16.716.716.716.7 15.37.3%subscript15.3percent7.315.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.3\%}}15.3 start_POSTSUBSCRIPT - 7.3 % end_POSTSUBSCRIPT 15.09.1%subscript15.0percent9.115.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.1\%}}15.0 start_POSTSUBSCRIPT - 9.1 % end_POSTSUBSCRIPT 13.617.6%subscript13.6percent17.6\bm{13.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 7.6\%}}}bold_13.6 start_POSTSUBSCRIPT bold_- bold_17.6 bold_% end_POSTSUBSCRIPT 9.59.59.59.5 5.85.85.85.8
car 17.417.417.417.4 16.816.816.816.8 14.914.4%subscript14.9percent14.414.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.4\%}}14.9 start_POSTSUBSCRIPT - 14.4 % end_POSTSUBSCRIPT 14.814.9%subscript14.8percent14.914.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.9\%}}14.8 start_POSTSUBSCRIPT - 14.9 % end_POSTSUBSCRIPT 14.914.4%subscript14.9percent14.4\bm{14.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 4.4\%}}}bold_14.9 start_POSTSUBSCRIPT bold_- bold_14.4 bold_% end_POSTSUBSCRIPT 9.99.99.99.9 7.97.97.97.9
station 12.012.012.012.0 11.611.611.611.6 10.710.8%subscript10.7percent10.810.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.8\%}}10.7 start_POSTSUBSCRIPT - 10.8 % end_POSTSUBSCRIPT 10.710.8%subscript10.7percent10.810.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.8\%}}10.7 start_POSTSUBSCRIPT - 10.8 % end_POSTSUBSCRIPT 10.314.2%subscript10.3percent14.2\bm{10.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 4.2\%}}}bold_10.3 start_POSTSUBSCRIPT bold_- bold_14.2 bold_% end_POSTSUBSCRIPT 6.66.66.66.6 5.05.05.05.0
train 15.315.315.315.3 15.215.215.215.2 14.55.2%subscript14.5percent5.214.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.2\%}}14.5 start_POSTSUBSCRIPT - 5.2 % end_POSTSUBSCRIPT 14.27.2%subscript14.2percent7.214.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.2\%}}14.2 start_POSTSUBSCRIPT - 7.2 % end_POSTSUBSCRIPT 12.816.3%subscript12.8percent16.3\bm{12.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 6.3\%}}}bold_12.8 start_POSTSUBSCRIPT bold_- bold_16.3 bold_% end_POSTSUBSCRIPT 10.310.310.310.3 7.97.97.97.9
street 17.417.417.417.4 17.217.217.217.2 16.92.9%subscript16.9percent2.916.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.9\%}}16.9 start_POSTSUBSCRIPT - 2.9 % end_POSTSUBSCRIPT 16.74.0%subscript16.7percent4.016.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.0\%}}16.7 start_POSTSUBSCRIPT - 4.0 % end_POSTSUBSCRIPT 16.17.5%subscript16.1percent7.5\bm{16.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7% .5\%}}}bold_16.1 start_POSTSUBSCRIPT bold_- bold_7.5 bold_% end_POSTSUBSCRIPT 12.412.412.412.4 9.99.99.99.9
airport 11.211.211.211.2 11.011.011.011.0 10.38.0%subscript10.3percent8.010.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.0\%}}10.3 start_POSTSUBSCRIPT - 8.0 % end_POSTSUBSCRIPT 10.19.8%subscript10.1percent9.810.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.8\%}}10.1 start_POSTSUBSCRIPT - 9.8 % end_POSTSUBSCRIPT 9.515.2%subscript9.5percent15.2\bm{9.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.2\%}}}bold_9.5 start_POSTSUBSCRIPT bold_- bold_15.2 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.54.54.54.5
exhibition 13.213.213.213.2 13.213.213.213.2 13.20.0%subscript13.2percent0.013.2_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}13.2 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 13.01.5%subscript13.0percent1.513.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.5\%}}13.0 start_POSTSUBSCRIPT - 1.5 % end_POSTSUBSCRIPT 12.83.0%subscript12.8percent3.0\bm{12.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3% .0\%}}}bold_12.8 start_POSTSUBSCRIPT bold_- bold_3.0 bold_% end_POSTSUBSCRIPT 8.38.38.38.3 5.85.85.85.8
restaurant 13.213.213.213.2 13.013.013.013.0 13.6+3.0%subscript13.6percent3.013.6_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}+3.0\%}}13.6 start_POSTSUBSCRIPT + 3.0 % end_POSTSUBSCRIPT 13.20.0%subscript13.2percent0.013.2_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}13.2 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 12.09.1%subscript12.0percent9.1\bm{12.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9% .1\%}}}bold_12.0 start_POSTSUBSCRIPT bold_- bold_9.1 bold_% end_POSTSUBSCRIPT 8.78.78.78.7 6.26.26.26.2
avg. 14.514.514.514.5 14.314.314.314.3 13.75.5%subscript13.7percent5.513.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.5\%}}13.7 start_POSTSUBSCRIPT - 5.5 % end_POSTSUBSCRIPT 13.56.9%subscript13.5percent6.913.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.9\%}}13.5 start_POSTSUBSCRIPT - 6.9 % end_POSTSUBSCRIPT 12.811.7%subscript12.8percent11.7\bm{12.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 1.7\%}}}bold_12.8 start_POSTSUBSCRIPT bold_- bold_11.7 bold_% end_POSTSUBSCRIPT 9.29.29.29.2 6.66.66.66.6
LS-FreeSound metro 9.99.99.99.9 9.89.89.89.8 9.45.1%subscript9.4percent5.19.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.1\%}}9.4 start_POSTSUBSCRIPT - 5.1 % end_POSTSUBSCRIPT 9.27.1%subscript9.2percent7.19.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.1\%}}9.2 start_POSTSUBSCRIPT - 7.1 % end_POSTSUBSCRIPT 8.217.2%subscript8.2percent17.2\bm{8.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.2\%}}}bold_8.2 start_POSTSUBSCRIPT bold_- bold_17.2 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.94.94.94.9
car 4.04.04.04.0 4.04.04.04.0 3.512.5%subscript3.5percent12.53.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}3.5 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 3.610.0%subscript3.6percent10.03.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}3.6 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 3.317.5%subscript3.3percent17.5\bm{3.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.5\%}}}bold_3.3 start_POSTSUBSCRIPT bold_- bold_17.5 bold_% end_POSTSUBSCRIPT 3.03.03.03.0 1.81.81.81.8
traffic 8.38.38.38.3 8.28.28.28.2 8.30.0%subscript8.3percent0.08.3_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}8.3 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 8.30.0%subscript8.3percent0.08.3_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}8.3 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 8.21.2%subscript8.2percent1.2\bm{8.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.2\%}}}bold_8.2 start_POSTSUBSCRIPT bold_- bold_1.2 bold_% end_POSTSUBSCRIPT 6.86.86.86.8 4.54.54.54.5
cafe 9.89.89.89.8 9.59.59.59.5 9.35.1%subscript9.3percent5.19.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.1\%}}9.3 start_POSTSUBSCRIPT - 5.1 % end_POSTSUBSCRIPT 9.17.1%subscript9.1percent7.19.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.1\%}}9.1 start_POSTSUBSCRIPT - 7.1 % end_POSTSUBSCRIPT 8.513.3%subscript8.5percent13.3\bm{8.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.3\%}}}bold_8.5 start_POSTSUBSCRIPT bold_- bold_13.3 bold_% end_POSTSUBSCRIPT 7.17.17.17.1 4.64.64.64.6
babble 32.032.032.032.0 31.831.831.831.8 31.70.9%subscript31.7percent0.931.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-0.9\%}}31.7 start_POSTSUBSCRIPT - 0.9 % end_POSTSUBSCRIPT 31.41.9%subscript31.4percent1.931.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.9\%}}31.4 start_POSTSUBSCRIPT - 1.9 % end_POSTSUBSCRIPT 30.93.4%subscript30.9percent3.4\bm{30.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3% .4\%}}}bold_30.9 start_POSTSUBSCRIPT bold_- bold_3.4 bold_% end_POSTSUBSCRIPT 28.728.728.728.7 19.319.319.319.3
ac/vacuum 12.412.412.412.4 12.512.512.512.5 11.84.8%subscript11.8percent4.811.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.8\%}}11.8 start_POSTSUBSCRIPT - 4.8 % end_POSTSUBSCRIPT 11.66.5%subscript11.6percent6.511.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.5\%}}11.6 start_POSTSUBSCRIPT - 6.5 % end_POSTSUBSCRIPT 11.29.7%subscript11.2percent9.7\bm{11.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9% .7\%}}}bold_11.2 start_POSTSUBSCRIPT bold_- bold_9.7 bold_% end_POSTSUBSCRIPT 10.210.210.210.2 6.26.26.26.2
avg. 12.712.712.712.7 12.612.612.612.6 12.33.1%subscript12.3percent3.112.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.1\%}}12.3 start_POSTSUBSCRIPT - 3.1 % end_POSTSUBSCRIPT 12.23.9%subscript12.2percent3.912.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.9\%}}12.2 start_POSTSUBSCRIPT - 3.9 % end_POSTSUBSCRIPT 11.77.9%subscript11.7percent7.9\bm{11.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7% .9\%}}}bold_11.7 start_POSTSUBSCRIPT bold_- bold_7.9 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 6.96.96.96.9
RATS test 45.745.745.745.7 45.645.645.645.6 45.50.4%subscript45.5percent0.445.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-0.4\%}}45.5 start_POSTSUBSCRIPT - 0.4 % end_POSTSUBSCRIPT 45.21.1%subscript45.2percent1.145.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.1\%}}45.2 start_POSTSUBSCRIPT - 1.1 % end_POSTSUBSCRIPT 43.64.6%subscript43.6percent4.6\bm{43.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4% .6\%}}}bold_43.6 start_POSTSUBSCRIPT bold_- bold_4.6 bold_% end_POSTSUBSCRIPT 38.838.838.838.8 23.623.623.623.6

Response Generation. In the generation stage, we adopt a temperature of 0.2 and top-1 sampling, i.e., greedy search. We observe the over-confidence phenomenon in our experiments (i.e., output probability distribution for decision is close to one-hot), which results in similar performance with different k𝑘kitalic_k for top-k𝑘kitalic_k sampling. Therefore, we select top-1 sampling for higher decoding efficiency.

LM Rescoring Baseline. For LMranksubscriptLM𝑟𝑎𝑛𝑘\mathrm{LM}_{rank}roman_LM start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k end_POSTSUBSCRIPT baseline in Table 1, we use a Transformer-based LM for typical rescoring, which is trained on the text transcriptions of each RobustHP subset using ESPnet toolkit 111111https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1 (Watanabe et al., 2018). The LM contains 16 Transformer layers with 8 heads and 512 attention units, and it is trained for 25 epochs with Adam optimizer (Kingma & Ba, 2014). The learning rate is set to 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT with 25,000 warm-up steps.

In-context Learning Baseline. We implement an in-context learning baseline for case study in Table 5, which is effective in making full use of LLM's powerful reasoning ability and linguistic knowledge (Dong et al., 2022). In particular, we utilize ChatGPT to conduct GER task using task-activated prompting (TAP) (Yang et al., 2023a): we first prompt ChatGPT to summarize what is ASR and typical LM rescoring, and then inform it the definition of ASR generative error correction, followed by several examples to teach it how to do such kind of error correction. With above background knowledge, we finally ask it to perform GER for our sample in case study.

Details of t-SNE Visualization. Fig. 4 and 6 present the t-SNE visualization of the language and audio noise embeddings. The language embeddings are the outputs of distillation tuner, which are selected from the LS-FreeSound test samples. The audio embeddings are encoder outputs of Whisper ASR model, where the speech samples also come from LS-FreeSound test samples. In particular, for better visualization we employ Stable-Whisper121212https://github.com/jianfch/stable-ts to extract the speech segments of same word ``for'' (i.e., around 5.7s in total from LS-FreeSound test data), as the distance between different phonemes is much larger than that between different noise conditions.

Table 9: WER (%) results of RobustGER with Falcon-7b finetuning. ``LMranksubscriptLM𝑟𝑎𝑛𝑘\text{LM}_{rank}LM start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k end_POSTSUBSCRIPT'' denotes LM rescoring. ``+ Audio Denoising'' denotes introducing audio embedding to denoise GER. onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT and ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT respectively denote the N-best oracle and compositional oracle that are defined in §5.1.
Test Set Baseline LMrank𝑟𝑎𝑛𝑘{}_{rank}start_FLOATSUBSCRIPT italic_r italic_a italic_n italic_k end_FLOATSUBSCRIPT GER + Audio Denoising RobustGER Oracle
(ours) onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT
CHiME-4 test-real 12.612.612.612.6 12.212.212.212.2 7.441.3%subscript7.4percent41.37.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-41.3\%}}7.4 start_POSTSUBSCRIPT - 41.3 % end_POSTSUBSCRIPT 7.242.9%subscript7.2percent42.97.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-42.9\%}}7.2 start_POSTSUBSCRIPT - 42.9 % end_POSTSUBSCRIPT 6.250.8%subscript6.2percent50.8\bm{6.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}}bold_6.2 start_POSTSUBSCRIPT bold_- bold_50.8 bold_% end_POSTSUBSCRIPT 10.510.510.510.5 3.03.03.03.0
test-simu 15.415.415.415.4 14.514.514.514.5 10.233.8%subscript10.2percent33.810.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-33.8\%}}10.2 start_POSTSUBSCRIPT - 33.8 % end_POSTSUBSCRIPT 10.035.1%subscript10.0percent35.110.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-35.1\%}}10.0 start_POSTSUBSCRIPT - 35.1 % end_POSTSUBSCRIPT 8.942.2%subscript8.9percent42.2\bm{8.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-42.2\%}}}bold_8.9 start_POSTSUBSCRIPT bold_- bold_42.2 bold_% end_POSTSUBSCRIPT 12.912.912.912.9 5.05.05.05.0
dev-real 10.610.610.610.6 10.310.310.310.3 5.845.3%subscript5.8percent45.35.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.3\%}}5.8 start_POSTSUBSCRIPT - 45.3 % end_POSTSUBSCRIPT 5.548.1%subscript5.5percent48.15.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.1\%}}5.5 start_POSTSUBSCRIPT - 48.1 % end_POSTSUBSCRIPT 4.854.7%subscript4.8percent54.7\bm{4.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-54.7\%}}}bold_4.8 start_POSTSUBSCRIPT bold_- bold_54.7 bold_% end_POSTSUBSCRIPT 9.19.19.19.1 2.12.12.12.1
dev-simu 12.412.412.412.4 11.911.911.911.9 7.737.9%subscript7.7percent37.97.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-37.9\%}}7.7 start_POSTSUBSCRIPT - 37.9 % end_POSTSUBSCRIPT 7.441.7%subscript7.4percent41.77.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-41.7\%}}7.4 start_POSTSUBSCRIPT - 41.7 % end_POSTSUBSCRIPT 6.547.6%subscript6.5percent47.6\bm{6.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.6\%}}}bold_6.5 start_POSTSUBSCRIPT bold_- bold_47.6 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 3.33.33.33.3
avg. 12.812.812.812.8 12.212.212.212.2 7.839.1%subscript7.8percent39.17.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-39.1\%}}7.8 start_POSTSUBSCRIPT - 39.1 % end_POSTSUBSCRIPT 7.541.4%subscript7.5percent41.47.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-41.4\%}}7.5 start_POSTSUBSCRIPT - 41.4 % end_POSTSUBSCRIPT 6.648.4%subscript6.6percent48.4\bm{6.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}}bold_6.6 start_POSTSUBSCRIPT bold_- bold_48.4 bold_% end_POSTSUBSCRIPT 10.810.810.810.8 3.43.43.43.4
VB-DEMAND baby-cry 8.08.08.08.0 7.87.87.87.8 7.210.0%subscript7.2percent10.07.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}7.2 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 7.012.5%subscript7.0percent12.57.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}7.0 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 6.716.3%subscript6.7percent16.3\bm{6.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.3\%}}}bold_6.7 start_POSTSUBSCRIPT bold_- bold_16.3 bold_% end_POSTSUBSCRIPT 4.54.54.54.5 3.03.03.03.0
helicopter 8.48.48.48.4 8.18.18.18.1 7.87.1%subscript7.8percent7.17.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.1\%}}7.8 start_POSTSUBSCRIPT - 7.1 % end_POSTSUBSCRIPT 7.78.3%subscript7.7percent8.37.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.3\%}}7.7 start_POSTSUBSCRIPT - 8.3 % end_POSTSUBSCRIPT 7.214.3%subscript7.2percent14.3\bm{7.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.3\%}}}bold_7.2 start_POSTSUBSCRIPT bold_- bold_14.3 bold_% end_POSTSUBSCRIPT 4.84.84.84.8 3.23.23.23.2
crowd-party 22.622.622.622.6 22.322.322.322.3 21.74.0%subscript21.7percent4.021.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.0\%}}21.7 start_POSTSUBSCRIPT - 4.0 % end_POSTSUBSCRIPT 21.45.3%subscript21.4percent5.321.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}21.4 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 20.59.3%subscript20.5percent9.3\bm{20.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9% .3\%}}}bold_20.5 start_POSTSUBSCRIPT bold_- bold_9.3 bold_% end_POSTSUBSCRIPT 16.516.516.516.5 11.511.511.511.5
avg. 13.013.013.013.0 12.712.712.712.7 12.26.2%subscript12.2percent6.212.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.2\%}}12.2 start_POSTSUBSCRIPT - 6.2 % end_POSTSUBSCRIPT 12.07.7%subscript12.0percent7.712.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.7\%}}12.0 start_POSTSUBSCRIPT - 7.7 % end_POSTSUBSCRIPT 11.511.5%subscript11.5percent11.5\bm{11.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 1.5\%}}}bold_11.5 start_POSTSUBSCRIPT bold_- bold_11.5 bold_% end_POSTSUBSCRIPT 8.68.68.68.6 5.95.95.95.9
NOIZEUS babble 16.516.516.516.5 16.716.716.716.7 16.9+2.4%subscript16.9percent2.416.9_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}+2.4\%}}16.9 start_POSTSUBSCRIPT + 2.4 % end_POSTSUBSCRIPT 16.50.0%subscript16.5percent0.016.5_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}16.5 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 15.37.3%subscript15.3percent7.3\bm{15.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7% .3\%}}}bold_15.3 start_POSTSUBSCRIPT bold_- bold_7.3 bold_% end_POSTSUBSCRIPT 9.59.59.59.5 5.85.85.85.8
car 17.417.417.417.4 16.816.816.816.8 15.79.8%subscript15.7percent9.815.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.8\%}}15.7 start_POSTSUBSCRIPT - 9.8 % end_POSTSUBSCRIPT 15.411.5%subscript15.4percent11.515.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.5\%}}15.4 start_POSTSUBSCRIPT - 11.5 % end_POSTSUBSCRIPT 14.914.4%subscript14.9percent14.4\bm{14.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 4.4\%}}}bold_14.9 start_POSTSUBSCRIPT bold_- bold_14.4 bold_% end_POSTSUBSCRIPT 9.99.99.99.9 7.97.97.97.9
station 12.012.012.012.0 11.611.611.611.6 11.63.3%subscript11.6percent3.311.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.3\%}}11.6 start_POSTSUBSCRIPT - 3.3 % end_POSTSUBSCRIPT 11.26.7%subscript11.2percent6.711.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.7\%}}11.2 start_POSTSUBSCRIPT - 6.7 % end_POSTSUBSCRIPT 9.124.2%subscript9.1percent24.2\bm{9.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-24.2\%}}}bold_9.1 start_POSTSUBSCRIPT bold_- bold_24.2 bold_% end_POSTSUBSCRIPT 6.66.66.66.6 5.05.05.05.0
train 15.315.315.315.3 15.215.215.215.2 16.5+7.8%subscript16.5percent7.816.5_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}+7.8\%}}16.5 start_POSTSUBSCRIPT + 7.8 % end_POSTSUBSCRIPT 14.64.6%subscript14.6percent4.614.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.6\%}}14.6 start_POSTSUBSCRIPT - 4.6 % end_POSTSUBSCRIPT 12.816.3%subscript12.8percent16.3\bm{12.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 6.3\%}}}bold_12.8 start_POSTSUBSCRIPT bold_- bold_16.3 bold_% end_POSTSUBSCRIPT 10.310.310.310.3 7.97.97.97.9
street 17.417.417.417.4 17.217.217.217.2 16.17.5%subscript16.1percent7.516.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}16.1 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 16.08.0%subscript16.0percent8.0\bm{16.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8% .0\%}}}bold_16.0 start_POSTSUBSCRIPT bold_- bold_8.0 bold_% end_POSTSUBSCRIPT 16.17.5%subscript16.1percent7.516.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}16.1 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 12.412.412.412.4 9.99.99.99.9
airport 11.211.211.211.2 11.011.011.011.0 10.74.5%subscript10.7percent4.510.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.5\%}}10.7 start_POSTSUBSCRIPT - 4.5 % end_POSTSUBSCRIPT 10.65.4%subscript10.6percent5.410.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.4\%}}10.6 start_POSTSUBSCRIPT - 5.4 % end_POSTSUBSCRIPT 10.38.0%subscript10.3percent8.0\bm{10.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8% .0\%}}}bold_10.3 start_POSTSUBSCRIPT bold_- bold_8.0 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.54.54.54.5
exhibition 13.213.213.213.2 13.213.213.213.2 12.83.0%subscript12.8percent3.012.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.0\%}}12.8 start_POSTSUBSCRIPT - 3.0 % end_POSTSUBSCRIPT 12.55.3%subscript12.5percent5.312.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}12.5 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 12.09.1%subscript12.0percent9.1\bm{12.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9% .1\%}}}bold_12.0 start_POSTSUBSCRIPT bold_- bold_9.1 bold_% end_POSTSUBSCRIPT 8.38.38.38.3 5.85.85.85.8
restaurant 13.213.213.213.2 13.013.013.013.0 12.83.0%subscript12.8percent3.012.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.0\%}}12.8 start_POSTSUBSCRIPT - 3.0 % end_POSTSUBSCRIPT 12.64.5%subscript12.6percent4.512.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.5\%}}12.6 start_POSTSUBSCRIPT - 4.5 % end_POSTSUBSCRIPT 12.09.1%subscript12.0percent9.1\bm{12.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9% .1\%}}}bold_12.0 start_POSTSUBSCRIPT bold_- bold_9.1 bold_% end_POSTSUBSCRIPT 8.78.78.78.7 6.26.26.26.2
avg. 14.514.514.514.5 14.314.314.314.3 14.12.8%subscript14.1percent2.814.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.8\%}}14.1 start_POSTSUBSCRIPT - 2.8 % end_POSTSUBSCRIPT 13.75.5%subscript13.7percent5.513.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.5\%}}13.7 start_POSTSUBSCRIPT - 5.5 % end_POSTSUBSCRIPT 12.811.7%subscript12.8percent11.7\bm{12.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 1.7\%}}}bold_12.8 start_POSTSUBSCRIPT bold_- bold_11.7 bold_% end_POSTSUBSCRIPT 9.29.29.29.2 6.66.66.66.6
LS-FreeSound metro 9.99.99.99.9 9.89.89.89.8 10.3+4.0%subscript10.3percent4.010.3_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}+4.0\%}}10.3 start_POSTSUBSCRIPT + 4.0 % end_POSTSUBSCRIPT 9.90.0%subscript9.9percent0.09.9_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}9.9 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 8.910.1%subscript8.9percent10.1\bm{8.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.1\%}}}bold_8.9 start_POSTSUBSCRIPT bold_- bold_10.1 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.94.94.94.9
car 4.04.04.04.0 4.04.04.04.0 3.77.5%subscript3.7percent7.53.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}3.7 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 3.77.5%subscript3.7percent7.53.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}3.7 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 3.512.5%subscript3.5percent12.5\bm{3.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}}bold_3.5 start_POSTSUBSCRIPT bold_- bold_12.5 bold_% end_POSTSUBSCRIPT 3.03.03.03.0 1.81.81.81.8
traffic 8.38.38.38.3 8.28.28.28.2 8.21.2%subscript8.2percent1.28.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.2\%}}8.2 start_POSTSUBSCRIPT - 1.2 % end_POSTSUBSCRIPT 8.03.6%subscript8.0percent3.68.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.6\%}}8.0 start_POSTSUBSCRIPT - 3.6 % end_POSTSUBSCRIPT 7.59.6%subscript7.5percent9.6\bm{7.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.6\%}}}bold_7.5 start_POSTSUBSCRIPT bold_- bold_9.6 bold_% end_POSTSUBSCRIPT 6.86.86.86.8 4.54.54.54.5
cafe 9.89.89.89.8 9.59.59.59.5 8.117.3%subscript8.1percent17.38.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.3\%}}8.1 start_POSTSUBSCRIPT - 17.3 % end_POSTSUBSCRIPT 8.018.4%subscript8.0percent18.48.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.4\%}}8.0 start_POSTSUBSCRIPT - 18.4 % end_POSTSUBSCRIPT 7.919.4%subscript7.9percent19.4\bm{7.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-19.4\%}}}bold_7.9 start_POSTSUBSCRIPT bold_- bold_19.4 bold_% end_POSTSUBSCRIPT 7.17.17.17.1 4.64.64.64.6
babble 32.032.032.032.0 31.831.831.831.8 31.12.8%subscript31.1percent2.831.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.8\%}}31.1 start_POSTSUBSCRIPT - 2.8 % end_POSTSUBSCRIPT 30.93.4%subscript30.9percent3.430.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.4\%}}30.9 start_POSTSUBSCRIPT - 3.4 % end_POSTSUBSCRIPT 30.54.7%subscript30.5percent4.7\bm{30.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4% .7\%}}}bold_30.5 start_POSTSUBSCRIPT bold_- bold_4.7 bold_% end_POSTSUBSCRIPT 28.728.728.728.7 19.319.319.319.3
ac/vacuum 12.412.412.412.4 12.512.512.512.5 12.6+1.6%subscript12.6percent1.612.6_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}+1.6\%}}12.6 start_POSTSUBSCRIPT + 1.6 % end_POSTSUBSCRIPT 12.6+1.6%subscript12.6percent1.612.6_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}+1.6\%}}12.6 start_POSTSUBSCRIPT + 1.6 % end_POSTSUBSCRIPT 12.21.6%subscript12.2percent1.6\bm{12.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% .6\%}}}bold_12.2 start_POSTSUBSCRIPT bold_- bold_1.6 bold_% end_POSTSUBSCRIPT 10.210.210.210.2 6.26.26.26.2
avg. 12.712.712.712.7 12.612.612.612.6 12.33.1%subscript12.3percent3.112.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.1\%}}12.3 start_POSTSUBSCRIPT - 3.1 % end_POSTSUBSCRIPT 12.23.9%subscript12.2percent3.912.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.9\%}}12.2 start_POSTSUBSCRIPT - 3.9 % end_POSTSUBSCRIPT 11.87.1%subscript11.8percent7.1\bm{11.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7% .1\%}}}bold_11.8 start_POSTSUBSCRIPT bold_- bold_7.1 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 6.96.96.96.9
RATS test 45.745.745.745.7 45.645.645.645.6 45.30.9%subscript45.3percent0.945.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-0.9\%}}45.3 start_POSTSUBSCRIPT - 0.9 % end_POSTSUBSCRIPT 44.91.8%subscript44.9percent1.844.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.8\%}}44.9 start_POSTSUBSCRIPT - 1.8 % end_POSTSUBSCRIPT 43.35.3%subscript43.3percent5.3\bm{43.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5% .3\%}}}bold_43.3 start_POSTSUBSCRIPT bold_- bold_5.3 bold_% end_POSTSUBSCRIPT 38.838.838.838.8 23.623.623.623.6
Table 10: WER (%) results of RobustGER with LLaMA-2-13b finetuning. ``LMranksubscriptLM𝑟𝑎𝑛𝑘\text{LM}_{rank}LM start_POSTSUBSCRIPT italic_r italic_a italic_n italic_k end_POSTSUBSCRIPT'' denotes LM rescoring. ``+ Audio Denoising'' denotes introducing audio embedding to denoise GER. onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT and ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT respectively denote the N-best oracle and compositional oracle that are defined in §5.1.
Test Set Baseline LMrank𝑟𝑎𝑛𝑘{}_{rank}start_FLOATSUBSCRIPT italic_r italic_a italic_n italic_k end_FLOATSUBSCRIPT GER + Audio Denoising RobustGER Oracle
(ours) onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT
CHiME-4 test-real 12.612.612.612.6 12.212.212.212.2 5.556.3%subscript5.5percent56.35.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-56.3\%}}5.5 start_POSTSUBSCRIPT - 56.3 % end_POSTSUBSCRIPT 5.357.9%subscript5.3percent57.95.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-57.9\%}}5.3 start_POSTSUBSCRIPT - 57.9 % end_POSTSUBSCRIPT 4.961.1%subscript4.9percent61.1\bm{4.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-61.1\%}}}bold_4.9 start_POSTSUBSCRIPT bold_- bold_61.1 bold_% end_POSTSUBSCRIPT 10.510.510.510.5 3.03.03.03.0
test-simu 15.415.415.415.4 14.514.514.514.5 8.147.4%subscript8.1percent47.48.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.4\%}}8.1 start_POSTSUBSCRIPT - 47.4 % end_POSTSUBSCRIPT 8.246.8%subscript8.2percent46.88.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.8\%}}8.2 start_POSTSUBSCRIPT - 46.8 % end_POSTSUBSCRIPT 7.948.7%subscript7.9percent48.7\bm{7.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.7\%}}}bold_7.9 start_POSTSUBSCRIPT bold_- bold_48.7 bold_% end_POSTSUBSCRIPT 12.912.912.912.9 5.05.05.05.0
dev-real 10.610.610.610.6 10.310.310.310.3 4.161.3%subscript4.1percent61.34.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-61.3\%}}4.1 start_POSTSUBSCRIPT - 61.3 % end_POSTSUBSCRIPT 3.864.2%subscript3.8percent64.23.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-64.2\%}}3.8 start_POSTSUBSCRIPT - 64.2 % end_POSTSUBSCRIPT 3.368.9%subscript3.3percent68.9\bm{3.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-68.9\%}}}bold_3.3 start_POSTSUBSCRIPT bold_- bold_68.9 bold_% end_POSTSUBSCRIPT 9.19.19.19.1 2.12.12.12.1
dev-simu 12.412.412.412.4 11.911.911.911.9 6.150.8%subscript6.1percent50.86.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}6.1 start_POSTSUBSCRIPT - 50.8 % end_POSTSUBSCRIPT 5.952.4%subscript5.9percent52.45.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.4\%}}5.9 start_POSTSUBSCRIPT - 52.4 % end_POSTSUBSCRIPT 5.158.9%subscript5.1percent58.9\bm{5.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-58.9\%}}}bold_5.1 start_POSTSUBSCRIPT bold_- bold_58.9 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 3.33.33.33.3
avg. 12.812.812.812.8 12.212.212.212.2 6.053.1%subscript6.0percent53.16.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.1\%}}6.0 start_POSTSUBSCRIPT - 53.1 % end_POSTSUBSCRIPT 5.854.7%subscript5.8percent54.75.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-54.7\%}}5.8 start_POSTSUBSCRIPT - 54.7 % end_POSTSUBSCRIPT 5.358.6%subscript5.3percent58.6\bm{5.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-58.6\%}}}bold_5.3 start_POSTSUBSCRIPT bold_- bold_58.6 bold_% end_POSTSUBSCRIPT 10.810.810.810.8 3.43.43.43.4
VB-DEMAND baby-cry 8.08.08.08.0 7.87.87.87.8 6.716.3%subscript6.7percent16.36.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.3\%}}6.7 start_POSTSUBSCRIPT - 16.3 % end_POSTSUBSCRIPT 6.617.5%subscript6.6percent17.56.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.5\%}}6.6 start_POSTSUBSCRIPT - 17.5 % end_POSTSUBSCRIPT 6.025.0%subscript6.0percent25.0\bm{6.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-25.0\%}}}bold_6.0 start_POSTSUBSCRIPT bold_- bold_25.0 bold_% end_POSTSUBSCRIPT 4.54.54.54.5 3.03.03.03.0
helicopter 8.48.48.48.4 8.18.18.18.1 7.214.3%subscript7.2percent14.37.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.3\%}}7.2 start_POSTSUBSCRIPT - 14.3 % end_POSTSUBSCRIPT 7.016.7%subscript7.0percent16.77.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}7.0 start_POSTSUBSCRIPT - 16.7 % end_POSTSUBSCRIPT 6.522.6%subscript6.5percent22.6\bm{6.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.6\%}}}bold_6.5 start_POSTSUBSCRIPT bold_- bold_22.6 bold_% end_POSTSUBSCRIPT 4.84.84.84.8 3.23.23.23.2
crowd-party 22.622.622.622.6 22.322.322.322.3 21.07.1%subscript21.0percent7.121.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.1\%}}21.0 start_POSTSUBSCRIPT - 7.1 % end_POSTSUBSCRIPT 20.68.8%subscript20.6percent8.820.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.8\%}}20.6 start_POSTSUBSCRIPT - 8.8 % end_POSTSUBSCRIPT 19.613.3%subscript19.6percent13.3\bm{19.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 3.3\%}}}bold_19.6 start_POSTSUBSCRIPT bold_- bold_13.3 bold_% end_POSTSUBSCRIPT 16.516.516.516.5 11.511.511.511.5
avg. 13.013.013.013.0 12.712.712.712.7 11.610.8%subscript11.6percent10.811.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.8\%}}11.6 start_POSTSUBSCRIPT - 10.8 % end_POSTSUBSCRIPT 11.412.3%subscript11.4percent12.311.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.3\%}}11.4 start_POSTSUBSCRIPT - 12.3 % end_POSTSUBSCRIPT 10.717.7%subscript10.7percent17.7\bm{10.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 7.7\%}}}bold_10.7 start_POSTSUBSCRIPT bold_- bold_17.7 bold_% end_POSTSUBSCRIPT 8.68.68.68.6 5.95.95.95.9
NOIZEUS babble 16.516.516.516.5 16.716.716.716.7 15.37.3%subscript15.3percent7.315.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.3\%}}15.3 start_POSTSUBSCRIPT - 7.3 % end_POSTSUBSCRIPT 15.27.9%subscript15.2percent7.9\bm{15.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7% .9\%}}}bold_15.2 start_POSTSUBSCRIPT bold_- bold_7.9 bold_% end_POSTSUBSCRIPT 15.37.3%subscript15.3percent7.315.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.3\%}}15.3 start_POSTSUBSCRIPT - 7.3 % end_POSTSUBSCRIPT 9.59.59.59.5 5.85.85.85.8
car 17.417.417.417.4 16.816.816.816.8 14.914.4%subscript14.9percent14.414.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.4\%}}14.9 start_POSTSUBSCRIPT - 14.4 % end_POSTSUBSCRIPT 14.715.5%subscript14.7percent15.514.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.5\%}}14.7 start_POSTSUBSCRIPT - 15.5 % end_POSTSUBSCRIPT 14.019.5%subscript14.0percent19.5\bm{14.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 9.5\%}}}bold_14.0 start_POSTSUBSCRIPT bold_- bold_19.5 bold_% end_POSTSUBSCRIPT 9.99.99.99.9 7.97.97.97.9
station 12.012.012.012.0 11.611.611.611.6 9.520.8%subscript9.5percent20.89.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.8\%}}9.5 start_POSTSUBSCRIPT - 20.8 % end_POSTSUBSCRIPT 9.421.7%subscript9.4percent21.79.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-21.7\%}}9.4 start_POSTSUBSCRIPT - 21.7 % end_POSTSUBSCRIPT 9.124.2%subscript9.1percent24.2\bm{9.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-24.2\%}}}bold_9.1 start_POSTSUBSCRIPT bold_- bold_24.2 bold_% end_POSTSUBSCRIPT 6.66.66.66.6 5.05.05.05.0
train 15.315.315.315.3 15.215.215.215.2 15.30.0%subscript15.3percent0.015.3_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}15.3 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 14.73.9%subscript14.7percent3.914.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.9\%}}14.7 start_POSTSUBSCRIPT - 3.9 % end_POSTSUBSCRIPT 12.816.3%subscript12.8percent16.3\bm{12.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 6.3\%}}}bold_12.8 start_POSTSUBSCRIPT bold_- bold_16.3 bold_% end_POSTSUBSCRIPT 10.310.310.310.3 7.97.97.97.9
street 17.417.417.417.4 17.217.217.217.2 16.92.9%subscript16.9percent2.9\bm{16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2% .9\%}}}bold_16.9 start_POSTSUBSCRIPT bold_- bold_2.9 bold_% end_POSTSUBSCRIPT 16.92.9%subscript16.9percent2.9\bm{16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2% .9\%}}}bold_16.9 start_POSTSUBSCRIPT bold_- bold_2.9 bold_% end_POSTSUBSCRIPT 16.92.9%subscript16.9percent2.9\bm{16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2% .9\%}}}bold_16.9 start_POSTSUBSCRIPT bold_- bold_2.9 bold_% end_POSTSUBSCRIPT 12.412.412.412.4 9.99.99.99.9
airport 11.211.211.211.2 11.011.011.011.0 10.74.5%subscript10.7percent4.510.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.5\%}}10.7 start_POSTSUBSCRIPT - 4.5 % end_POSTSUBSCRIPT 10.38.0%subscript10.3percent8.010.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.0\%}}10.3 start_POSTSUBSCRIPT - 8.0 % end_POSTSUBSCRIPT 8.722.3%subscript8.7percent22.3\bm{8.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.3\%}}}bold_8.7 start_POSTSUBSCRIPT bold_- bold_22.3 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.54.54.54.5
exhibition 13.213.213.213.2 13.213.213.213.2 12.09.1%subscript12.0percent9.112.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.1\%}}12.0 start_POSTSUBSCRIPT - 9.1 % end_POSTSUBSCRIPT 11.612.1%subscript11.6percent12.111.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.1\%}}11.6 start_POSTSUBSCRIPT - 12.1 % end_POSTSUBSCRIPT 10.718.9%subscript10.7percent18.9\bm{10.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 8.9\%}}}bold_10.7 start_POSTSUBSCRIPT bold_- bold_18.9 bold_% end_POSTSUBSCRIPT 8.38.38.38.3 5.85.85.85.8
restaurant 13.213.213.213.2 13.013.013.013.0 12.46.1%subscript12.4percent6.112.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.1\%}}12.4 start_POSTSUBSCRIPT - 6.1 % end_POSTSUBSCRIPT 12.18.3%subscript12.1percent8.312.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.3\%}}12.1 start_POSTSUBSCRIPT - 8.3 % end_POSTSUBSCRIPT 10.322.0%subscript10.3percent22.0\bm{10.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2% 2.0\%}}}bold_10.3 start_POSTSUBSCRIPT bold_- bold_22.0 bold_% end_POSTSUBSCRIPT 8.78.78.78.7 6.26.26.26.2
avg. 14.514.514.514.5 14.314.314.314.3 13.47.6%subscript13.4percent7.613.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.6\%}}13.4 start_POSTSUBSCRIPT - 7.6 % end_POSTSUBSCRIPT 13.19.7%subscript13.1percent9.713.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.7\%}}13.1 start_POSTSUBSCRIPT - 9.7 % end_POSTSUBSCRIPT 12.215.9%subscript12.2percent15.9\bm{12.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 5.9\%}}}bold_12.2 start_POSTSUBSCRIPT bold_- bold_15.9 bold_% end_POSTSUBSCRIPT 9.29.29.29.2 6.66.66.66.6
LS-FreeSound metro 9.99.99.99.9 9.89.89.89.8 9.72.0%subscript9.7percent2.09.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.0\%}}9.7 start_POSTSUBSCRIPT - 2.0 % end_POSTSUBSCRIPT 9.45.1%subscript9.4percent5.19.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.1\%}}9.4 start_POSTSUBSCRIPT - 5.1 % end_POSTSUBSCRIPT 8.613.1%subscript8.6percent13.1\bm{8.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.1\%}}}bold_8.6 start_POSTSUBSCRIPT bold_- bold_13.1 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.94.94.94.9
car 4.04.04.04.0 4.04.04.04.0 3.77.5%subscript3.7percent7.53.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}3.7 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 3.85.0%subscript3.8percent5.03.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.0\%}}3.8 start_POSTSUBSCRIPT - 5.0 % end_POSTSUBSCRIPT 3.512.5%subscript3.5percent12.5\bm{3.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}}bold_3.5 start_POSTSUBSCRIPT bold_- bold_12.5 bold_% end_POSTSUBSCRIPT 3.03.03.03.0 1.81.81.81.8
traffic 8.38.38.38.3 8.28.28.28.2 8.30.0%subscript8.3percent0.08.3_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}8.3 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 8.21.2%subscript8.2percent1.28.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.2\%}}8.2 start_POSTSUBSCRIPT - 1.2 % end_POSTSUBSCRIPT 7.68.4%subscript7.6percent8.4\bm{7.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.4\%}}}bold_7.6 start_POSTSUBSCRIPT bold_- bold_8.4 bold_% end_POSTSUBSCRIPT 6.86.86.86.8 4.54.54.54.5
cafe 9.89.89.89.8 9.59.59.59.5 8.711.2%subscript8.7percent11.28.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.2\%}}8.7 start_POSTSUBSCRIPT - 11.2 % end_POSTSUBSCRIPT 8.513.3%subscript8.5percent13.38.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.3\%}}8.5 start_POSTSUBSCRIPT - 13.3 % end_POSTSUBSCRIPT 7.523.5%subscript7.5percent23.5\bm{7.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.5\%}}}bold_7.5 start_POSTSUBSCRIPT bold_- bold_23.5 bold_% end_POSTSUBSCRIPT 7.17.17.17.1 4.64.64.64.6
babble 32.032.032.032.0 31.831.831.831.8 31.80.6%subscript31.8percent0.631.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-0.6\%}}31.8 start_POSTSUBSCRIPT - 0.6 % end_POSTSUBSCRIPT 31.61.3%subscript31.6percent1.331.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.3\%}}31.6 start_POSTSUBSCRIPT - 1.3 % end_POSTSUBSCRIPT 30.83.8%subscript30.8percent3.8\bm{30.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3% .8\%}}}bold_30.8 start_POSTSUBSCRIPT bold_- bold_3.8 bold_% end_POSTSUBSCRIPT 28.728.728.728.7 19.319.319.319.3
ac/vacuum 12.412.412.412.4 12.512.512.512.5 11.57.3%subscript11.5percent7.311.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.3\%}}11.5 start_POSTSUBSCRIPT - 7.3 % end_POSTSUBSCRIPT 11.48.1%subscript11.4percent8.111.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.1\%}}11.4 start_POSTSUBSCRIPT - 8.1 % end_POSTSUBSCRIPT 11.011.3%subscript11.0percent11.3\bm{11.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 1.3\%}}}bold_11.0 start_POSTSUBSCRIPT bold_- bold_11.3 bold_% end_POSTSUBSCRIPT 10.210.210.210.2 6.26.26.26.2
avg. 12.712.712.712.7 12.612.612.612.6 12.33.1%subscript12.3percent3.112.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.1\%}}12.3 start_POSTSUBSCRIPT - 3.1 % end_POSTSUBSCRIPT 12.23.9%subscript12.2percent3.912.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.9\%}}12.2 start_POSTSUBSCRIPT - 3.9 % end_POSTSUBSCRIPT 11.59.4%subscript11.5percent9.4\bm{11.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9% .4\%}}}bold_11.5 start_POSTSUBSCRIPT bold_- bold_9.4 bold_% end_POSTSUBSCRIPT 10.610.610.610.6 6.96.96.96.9
RATS test 45.745.745.745.7 45.645.645.645.6 44.42.8%subscript44.4percent2.844.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.8\%}}44.4 start_POSTSUBSCRIPT - 2.8 % end_POSTSUBSCRIPT 44.03.7%subscript44.0percent3.744.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.7\%}}44.0 start_POSTSUBSCRIPT - 3.7 % end_POSTSUBSCRIPT 43.05.9%subscript43.0percent5.9\bm{43.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5% .9\%}}}bold_43.0 start_POSTSUBSCRIPT bold_- bold_5.9 bold_% end_POSTSUBSCRIPT 38.838.838.838.8 23.623.623.623.6

Appendix D Supplementary Experiments

D.1 Results on Different LLMs

Apart from LLaMA-2-7b, we also evaluate our proposed RobustGER approach on popular LLaMA-7b and Falcon-7b models as illustrated in Table 8 and 9. To further investigate the effect of LLM size on RobustGER, we conduct extra experiments on LLaMA-2-13b in Table 10.

Similar to the results of LLaMA-2-7b in Table 1, our proposed RobustGER achieves consistent gains of performance on various LLMs and testing conditions, which verifies its general effectiveness. On the other hand, there exists some performance difference between different LLMs. In particular, LLaMA-2-13b outperforms all the 7b LLMs due to its larger model capacity and stronger language generation ability. Among 7b models, LLaMA-2-7b outperforms LLaMA-7b and Falcon-7b thanks to larger-scale training data and longer context length.

Table 11: WER (%) results of RobustGER on different SNR-level testing conditions. The test sets are from LS-FreeSound dataset, with five SNR levels (i.e., {0, 5, 10, 15, 20}dB) on six noise types (i.e., ``Metro'', ``Car'', ``Traffic'', ``Cafe'', ``Babble'', and ``AC/Vacuum'').
Noise Type SNR (dB) Baseline LMrank𝑟𝑎𝑛𝑘{}_{rank}start_FLOATSUBSCRIPT italic_r italic_a italic_n italic_k end_FLOATSUBSCRIPT GER + Audio Denoising RobustGER Oracle
(ours) onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT
Metro 0 9.99.99.99.9 9.89.89.89.8 9.54.0%subscript9.5percent4.09.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.0\%}}9.5 start_POSTSUBSCRIPT - 4.0 % end_POSTSUBSCRIPT 9.45.1%subscript9.4percent5.19.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.1\%}}9.4 start_POSTSUBSCRIPT - 5.1 % end_POSTSUBSCRIPT 8.910.1%subscript8.9percent10.1\bm{8.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.1\%}}}bold_8.9 start_POSTSUBSCRIPT bold_- bold_10.1 bold_% end_POSTSUBSCRIPT 7.97.97.97.9 4.94.94.94.9
5 7.27.27.27.2 7.07.07.07.0 6.76.9%subscript6.7percent6.96.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.9\%}}6.7 start_POSTSUBSCRIPT - 6.9 % end_POSTSUBSCRIPT 6.411.1%subscript6.4percent11.16.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.1\%}}6.4 start_POSTSUBSCRIPT - 11.1 % end_POSTSUBSCRIPT 5.523.6%subscript5.5percent23.6\bm{5.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.6\%}}}bold_5.5 start_POSTSUBSCRIPT bold_- bold_23.6 bold_% end_POSTSUBSCRIPT 5.55.55.55.5 3.23.23.23.2
10 4.84.84.84.8 4.64.64.64.6 4.212.5%subscript4.2percent12.54.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}4.2 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 4.310.4%subscript4.3percent10.44.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.4\%}}4.3 start_POSTSUBSCRIPT - 10.4 % end_POSTSUBSCRIPT 4.016.7%subscript4.0percent16.7\bm{4.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}}bold_4.0 start_POSTSUBSCRIPT bold_- bold_16.7 bold_% end_POSTSUBSCRIPT 3.93.93.93.9 2.32.32.32.3
15 3.93.93.93.9 3.53.53.53.5 3.217.9%subscript3.2percent17.93.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}3.2 start_POSTSUBSCRIPT - 17.9 % end_POSTSUBSCRIPT 3.217.9%subscript3.2percent17.93.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}3.2 start_POSTSUBSCRIPT - 17.9 % end_POSTSUBSCRIPT 3.023.1%subscript3.0percent23.1\bm{3.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.1\%}}}bold_3.0 start_POSTSUBSCRIPT bold_- bold_23.1 bold_% end_POSTSUBSCRIPT 3.13.13.13.1 1.71.71.71.7
20 3.33.33.33.3 3.13.13.13.1 2.718.2%subscript2.7percent18.22.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.2\%}}2.7 start_POSTSUBSCRIPT - 18.2 % end_POSTSUBSCRIPT 2.621.2%subscript2.6percent21.22.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-21.2\%}}2.6 start_POSTSUBSCRIPT - 21.2 % end_POSTSUBSCRIPT 2.330.3%subscript2.3percent30.3\bm{2.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-30.3\%}}}bold_2.3 start_POSTSUBSCRIPT bold_- bold_30.3 bold_% end_POSTSUBSCRIPT 2.62.62.62.6 1.31.31.31.3
avg. 5.85.85.85.8 5.65.65.65.6 5.38.6%subscript5.3percent8.65.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.6\%}}5.3 start_POSTSUBSCRIPT - 8.6 % end_POSTSUBSCRIPT 5.210.3%subscript5.2percent10.35.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.3\%}}5.2 start_POSTSUBSCRIPT - 10.3 % end_POSTSUBSCRIPT 4.719.0%subscript4.7percent19.0\bm{4.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-19.0\%}}}bold_4.7 start_POSTSUBSCRIPT bold_- bold_19.0 bold_% end_POSTSUBSCRIPT 4.64.64.64.6 2.72.72.72.7
Car 0 4.04.04.04.0 4.04.04.04.0 3.77.5%subscript3.7percent7.53.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}3.7 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 3.512.5%subscript3.5percent12.53.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}3.5 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 3.122.5%subscript3.1percent22.5\bm{3.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.5\%}}}bold_3.1 start_POSTSUBSCRIPT bold_- bold_22.5 bold_% end_POSTSUBSCRIPT 3.03.03.03.0 1.81.81.81.8
5 3.83.83.83.8 3.53.53.53.5 3.118.4%subscript3.1percent18.43.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.4\%}}3.1 start_POSTSUBSCRIPT - 18.4 % end_POSTSUBSCRIPT 3.118.4%subscript3.1percent18.43.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.4\%}}3.1 start_POSTSUBSCRIPT - 18.4 % end_POSTSUBSCRIPT 2.826.3%subscript2.8percent26.3\bm{2.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-26.3\%}}}bold_2.8 start_POSTSUBSCRIPT bold_- bold_26.3 bold_% end_POSTSUBSCRIPT 2.82.82.82.8 1.51.51.51.5
10 3.23.23.23.2 3.33.33.33.3 3.20.0%subscript3.2percent0.03.2_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}3.2 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 3.06.3%subscript3.0percent6.33.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.3\%}}3.0 start_POSTSUBSCRIPT - 6.3 % end_POSTSUBSCRIPT 2.231.3%subscript2.2percent31.3\bm{2.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-31.3\%}}}bold_2.2 start_POSTSUBSCRIPT bold_- bold_31.3 bold_% end_POSTSUBSCRIPT 2.42.42.42.4 1.41.41.41.4
15 2.82.82.82.8 2.72.72.72.7 2.510.7%subscript2.5percent10.72.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.7\%}}2.5 start_POSTSUBSCRIPT - 10.7 % end_POSTSUBSCRIPT 2.510.7%subscript2.5percent10.72.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.7\%}}2.5 start_POSTSUBSCRIPT - 10.7 % end_POSTSUBSCRIPT 2.317.9%subscript2.3percent17.9\bm{2.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}}bold_2.3 start_POSTSUBSCRIPT bold_- bold_17.9 bold_% end_POSTSUBSCRIPT 2.42.42.42.4 1.41.41.41.4
20 3.13.13.13.1 2.82.82.82.8 2.519.4%subscript2.5percent19.42.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-19.4\%}}2.5 start_POSTSUBSCRIPT - 19.4 % end_POSTSUBSCRIPT 2.422.6%subscript2.4percent22.62.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.6\%}}2.4 start_POSTSUBSCRIPT - 22.6 % end_POSTSUBSCRIPT 2.132.3%subscript2.1percent32.3\bm{2.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-32.3\%}}}bold_2.1 start_POSTSUBSCRIPT bold_- bold_32.3 bold_% end_POSTSUBSCRIPT 2.42.42.42.4 1.41.41.41.4
avg. 3.43.43.43.4 3.33.33.33.3 3.011.8%subscript3.0percent11.83.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.8\%}}3.0 start_POSTSUBSCRIPT - 11.8 % end_POSTSUBSCRIPT 2.914.7%subscript2.9percent14.72.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.7\%}}2.9 start_POSTSUBSCRIPT - 14.7 % end_POSTSUBSCRIPT 2.526.5%subscript2.5percent26.5\bm{2.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-26.5\%}}}bold_2.5 start_POSTSUBSCRIPT bold_- bold_26.5 bold_% end_POSTSUBSCRIPT 2.62.62.62.6 1.51.51.51.5
Traffic 0 8.38.38.38.3 8.28.28.28.2 8.03.6%subscript8.0percent3.68.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-3.6\%}}8.0 start_POSTSUBSCRIPT - 3.6 % end_POSTSUBSCRIPT 7.86.0%subscript7.8percent6.07.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.0\%}}7.8 start_POSTSUBSCRIPT - 6.0 % end_POSTSUBSCRIPT 7.59.6%subscript7.5percent9.6\bm{7.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.6\%}}}bold_7.5 start_POSTSUBSCRIPT bold_- bold_9.6 bold_% end_POSTSUBSCRIPT 6.86.86.86.8 4.54.54.54.5
5 6.36.36.36.3 6.16.16.16.1 5.611.1%subscript5.6percent11.15.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.1\%}}5.6 start_POSTSUBSCRIPT - 11.1 % end_POSTSUBSCRIPT 5.512.7%subscript5.5percent12.75.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.7\%}}5.5 start_POSTSUBSCRIPT - 12.7 % end_POSTSUBSCRIPT 4.922.2%subscript4.9percent22.2\bm{4.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.2\%}}}bold_4.9 start_POSTSUBSCRIPT bold_- bold_22.2 bold_% end_POSTSUBSCRIPT 4.94.94.94.9 3.23.23.23.2
10 3.83.83.83.8 3.63.63.63.6 3.313.2%subscript3.3percent13.23.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.2\%}}3.3 start_POSTSUBSCRIPT - 13.2 % end_POSTSUBSCRIPT 3.313.2%subscript3.3percent13.23.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.2\%}}3.3 start_POSTSUBSCRIPT - 13.2 % end_POSTSUBSCRIPT 3.215.8%subscript3.2percent15.8\bm{3.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.8\%}}}bold_3.2 start_POSTSUBSCRIPT bold_- bold_15.8 bold_% end_POSTSUBSCRIPT 3.23.23.23.2 1.91.91.91.9
15 3.43.43.43.4 3.13.13.13.1 2.914.7%subscript2.9percent14.72.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.7\%}}2.9 start_POSTSUBSCRIPT - 14.7 % end_POSTSUBSCRIPT 2.817.6%subscript2.8percent17.62.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.6\%}}2.8 start_POSTSUBSCRIPT - 17.6 % end_POSTSUBSCRIPT 2.429.4%subscript2.4percent29.4\bm{2.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-29.4\%}}}bold_2.4 start_POSTSUBSCRIPT bold_- bold_29.4 bold_% end_POSTSUBSCRIPT 2.82.82.82.8 1.71.71.71.7
20 3.73.73.73.7 3.53.53.53.5 3.48.1%subscript3.4percent8.13.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.1\%}}3.4 start_POSTSUBSCRIPT - 8.1 % end_POSTSUBSCRIPT 3.310.8%subscript3.3percent10.83.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.8\%}}3.3 start_POSTSUBSCRIPT - 10.8 % end_POSTSUBSCRIPT 3.018.9%subscript3.0percent18.9\bm{3.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.9\%}}}bold_3.0 start_POSTSUBSCRIPT bold_- bold_18.9 bold_% end_POSTSUBSCRIPT 2.92.92.92.9 1.71.71.71.7
avg. 5.15.15.15.1 4.94.94.94.9 4.69.8%subscript4.6percent9.84.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-9.8\%}}4.6 start_POSTSUBSCRIPT - 9.8 % end_POSTSUBSCRIPT 4.511.8%subscript4.5percent11.84.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.8\%}}4.5 start_POSTSUBSCRIPT - 11.8 % end_POSTSUBSCRIPT 4.217.6%subscript4.2percent17.6\bm{4.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.6\%}}}bold_4.2 start_POSTSUBSCRIPT bold_- bold_17.6 bold_% end_POSTSUBSCRIPT 4.14.14.14.1 2.62.62.62.6
Cafe 0 9.89.89.89.8 9.59.59.59.5 8.117.3%subscript8.1percent17.38.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.3\%}}8.1 start_POSTSUBSCRIPT - 17.3 % end_POSTSUBSCRIPT 8.117.3%subscript8.1percent17.38.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.3\%}}8.1 start_POSTSUBSCRIPT - 17.3 % end_POSTSUBSCRIPT 7.523.5%subscript7.5percent23.5\bm{7.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.5\%}}}bold_7.5 start_POSTSUBSCRIPT bold_- bold_23.5 bold_% end_POSTSUBSCRIPT 7.17.17.17.1 4.64.64.64.6
5 5.75.75.75.7 5.75.75.75.7 5.45.3%subscript5.4percent5.35.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}5.4 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 5.61.8%subscript5.6percent1.85.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.8\%}}5.6 start_POSTSUBSCRIPT - 1.8 % end_POSTSUBSCRIPT 5.37.0%subscript5.3percent7.0\bm{5.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.0\%}}}bold_5.3 start_POSTSUBSCRIPT bold_- bold_7.0 bold_% end_POSTSUBSCRIPT 4.54.54.54.5 2.62.62.62.6
10 5.05.05.05.0 4.74.74.74.7 4.510.0%subscript4.5percent10.04.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}4.5 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 4.216.0%subscript4.2percent16.04.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.0\%}}4.2 start_POSTSUBSCRIPT - 16.0 % end_POSTSUBSCRIPT 4.020.0%subscript4.0percent20.0\bm{4.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}}bold_4.0 start_POSTSUBSCRIPT bold_- bold_20.0 bold_% end_POSTSUBSCRIPT 3.83.83.83.8 2.22.22.22.2
15 3.63.63.63.6 3.53.53.53.5 3.38.3%subscript3.3percent8.33.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.3\%}}3.3 start_POSTSUBSCRIPT - 8.3 % end_POSTSUBSCRIPT 3.211.1%subscript3.2percent11.13.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.1\%}}3.2 start_POSTSUBSCRIPT - 11.1 % end_POSTSUBSCRIPT 3.016.7%subscript3.0percent16.7\bm{3.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}}bold_3.0 start_POSTSUBSCRIPT bold_- bold_16.7 bold_% end_POSTSUBSCRIPT 2.72.72.72.7 1.51.51.51.5
20 3.53.53.53.5 3.23.23.23.2 2.722.9%subscript2.7percent22.92.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.9\%}}2.7 start_POSTSUBSCRIPT - 22.9 % end_POSTSUBSCRIPT 2.917.1%subscript2.9percent17.12.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.1\%}}2.9 start_POSTSUBSCRIPT - 17.1 % end_POSTSUBSCRIPT 2.917.1%subscript2.9percent17.1\bm{2.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.1\%}}}bold_2.9 start_POSTSUBSCRIPT bold_- bold_17.1 bold_% end_POSTSUBSCRIPT 2.62.62.62.6 1.51.51.51.5
avg. 5.55.55.55.5 5.35.35.35.3 4.812.7%subscript4.8percent12.74.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.7\%}}4.8 start_POSTSUBSCRIPT - 12.7 % end_POSTSUBSCRIPT 4.812.7%subscript4.8percent12.74.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.7\%}}4.8 start_POSTSUBSCRIPT - 12.7 % end_POSTSUBSCRIPT 4.518.2%subscript4.5percent18.2\bm{4.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.2\%}}}bold_4.5 start_POSTSUBSCRIPT bold_- bold_18.2 bold_% end_POSTSUBSCRIPT 4.14.14.14.1 2.52.52.52.5
Babble 0 32.032.032.032.0 31.831.831.831.8 31.32.2%subscript31.3percent2.231.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.2\%}}31.3 start_POSTSUBSCRIPT - 2.2 % end_POSTSUBSCRIPT 31.61.3%subscript31.6percent1.331.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1.3\%}}31.6 start_POSTSUBSCRIPT - 1.3 % end_POSTSUBSCRIPT 31.12.8%subscript31.1percent2.8\bm{31.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2% .8\%}}}bold_31.1 start_POSTSUBSCRIPT bold_- bold_2.8 bold_% end_POSTSUBSCRIPT 28.728.728.728.7 19.319.319.319.3
5 17.017.017.017.0 16.816.816.816.8 17.00.0%subscript17.0percent0.017.0_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}17.0 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 16.62.4%subscript16.6percent2.416.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.4\%}}16.6 start_POSTSUBSCRIPT - 2.4 % end_POSTSUBSCRIPT 16.34.1%subscript16.3percent4.1\bm{16.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4% .1\%}}}bold_16.3 start_POSTSUBSCRIPT bold_- bold_4.1 bold_% end_POSTSUBSCRIPT 13.913.913.913.9 9.29.29.29.2
10 8.88.88.88.8 9.09.09.09.0 8.62.3%subscript8.6percent2.38.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.3\%}}8.6 start_POSTSUBSCRIPT - 2.3 % end_POSTSUBSCRIPT 8.44.5%subscript8.4percent4.58.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-4.5\%}}8.4 start_POSTSUBSCRIPT - 4.5 % end_POSTSUBSCRIPT 8.18.0%subscript8.1percent8.0\bm{8.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.0\%}}}bold_8.1 start_POSTSUBSCRIPT bold_- bold_8.0 bold_% end_POSTSUBSCRIPT 6.56.56.56.5 3.93.93.93.9
15 6.56.56.56.5 6.16.16.16.1 5.810.8%subscript5.8percent10.85.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.8\%}}5.8 start_POSTSUBSCRIPT - 10.8 % end_POSTSUBSCRIPT 5.712.3%subscript5.7percent12.35.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.3\%}}5.7 start_POSTSUBSCRIPT - 12.3 % end_POSTSUBSCRIPT 5.416.9%subscript5.4percent16.9\bm{5.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.9\%}}}bold_5.4 start_POSTSUBSCRIPT bold_- bold_16.9 bold_% end_POSTSUBSCRIPT 4.74.74.74.7 3.03.03.03.0
20 10.510.510.510.5 10.110.110.110.1 7.627.6%subscript7.6percent27.67.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-27.6\%}}7.6 start_POSTSUBSCRIPT - 27.6 % end_POSTSUBSCRIPT 7.627.6%subscript7.6percent27.67.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-27.6\%}}7.6 start_POSTSUBSCRIPT - 27.6 % end_POSTSUBSCRIPT 7.627.6%subscript7.6percent27.6\bm{7.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-27.6\%}}}bold_7.6 start_POSTSUBSCRIPT bold_- bold_27.6 bold_% end_POSTSUBSCRIPT 9.69.69.69.6 2.02.02.02.0
avg. 15.015.015.015.0 14.814.814.814.8 14.16.0%subscript14.1percent6.014.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.0\%}}14.1 start_POSTSUBSCRIPT - 6.0 % end_POSTSUBSCRIPT 14.06.7%subscript14.0percent6.714.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-6.7\%}}14.0 start_POSTSUBSCRIPT - 6.7 % end_POSTSUBSCRIPT 13.78.7%subscript13.7percent8.7\bm{13.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8% .7\%}}}bold_13.7 start_POSTSUBSCRIPT bold_- bold_8.7 bold_% end_POSTSUBSCRIPT 12.712.712.712.7 7.57.57.57.5
AC/Vacuum 0 12.412.412.412.4 12.512.512.512.5 12.30.8%subscript12.3percent0.812.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-0.8\%}}12.3 start_POSTSUBSCRIPT - 0.8 % end_POSTSUBSCRIPT 12.12.4%subscript12.1percent2.412.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-2.4\%}}12.1 start_POSTSUBSCRIPT - 2.4 % end_POSTSUBSCRIPT 11.48.1%subscript11.4percent8.1\bm{11.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8% .1\%}}}bold_11.4 start_POSTSUBSCRIPT bold_- bold_8.1 bold_% end_POSTSUBSCRIPT 10.210.210.210.2 6.26.26.26.2
5 7.47.47.47.4 7.07.07.07.0 6.512.2%subscript6.5percent12.26.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.2\%}}6.5 start_POSTSUBSCRIPT - 12.2 % end_POSTSUBSCRIPT 6.314.9%subscript6.3percent14.96.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.9\%}}6.3 start_POSTSUBSCRIPT - 14.9 % end_POSTSUBSCRIPT 5.821.6%subscript5.8percent21.6\bm{5.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-21.6\%}}}bold_5.8 start_POSTSUBSCRIPT bold_- bold_21.6 bold_% end_POSTSUBSCRIPT 5.55.55.55.5 3.13.13.13.1
10 6.66.66.66.6 6.26.26.26.2 5.516.7%subscript5.5percent16.75.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}5.5 start_POSTSUBSCRIPT - 16.7 % end_POSTSUBSCRIPT 5.615.2%subscript5.6percent15.25.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.2\%}}5.6 start_POSTSUBSCRIPT - 15.2 % end_POSTSUBSCRIPT 5.516.7%subscript5.5percent16.7\bm{5.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}}bold_5.5 start_POSTSUBSCRIPT bold_- bold_16.7 bold_% end_POSTSUBSCRIPT 4.54.54.54.5 2.62.62.62.6
15 4.44.44.44.4 4.24.24.24.2 3.715.9%subscript3.7percent15.93.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.9\%}}3.7 start_POSTSUBSCRIPT - 15.9 % end_POSTSUBSCRIPT 3.715.9%subscript3.7percent15.93.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.9\%}}3.7 start_POSTSUBSCRIPT - 15.9 % end_POSTSUBSCRIPT 3.618.2%subscript3.6percent18.2\bm{3.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.2\%}}}bold_3.6 start_POSTSUBSCRIPT bold_- bold_18.2 bold_% end_POSTSUBSCRIPT 3.33.33.33.3 1.81.81.81.8
20 3.83.83.83.8 3.73.73.73.7 3.313.2%subscript3.3percent13.23.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.2\%}}3.3 start_POSTSUBSCRIPT - 13.2 % end_POSTSUBSCRIPT 3.215.8%subscript3.2percent15.83.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.8\%}}3.2 start_POSTSUBSCRIPT - 15.8 % end_POSTSUBSCRIPT 2.923.7%subscript2.9percent23.7\bm{2.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-23.7\%}}}bold_2.9 start_POSTSUBSCRIPT bold_- bold_23.7 bold_% end_POSTSUBSCRIPT 2.82.82.82.8 1.41.41.41.4
avg. 6.96.96.96.9 6.76.76.76.7 6.38.7%subscript6.3percent8.76.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.7\%}}6.3 start_POSTSUBSCRIPT - 8.7 % end_POSTSUBSCRIPT 6.210.1%subscript6.2percent10.16.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.1\%}}6.2 start_POSTSUBSCRIPT - 10.1 % end_POSTSUBSCRIPT 5.815.9%subscript5.8percent15.9\bm{5.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.9\%}}}bold_5.8 start_POSTSUBSCRIPT bold_- bold_15.9 bold_% end_POSTSUBSCRIPT 5.35.35.35.3 3.03.03.03.0
Clean \infty 3.03.03.03.0 2.82.82.82.8 2.516.7%subscript2.5percent16.72.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}2.5 start_POSTSUBSCRIPT - 16.7 % end_POSTSUBSCRIPT 2.420.0%subscript2.4percent20.02.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}2.4 start_POSTSUBSCRIPT - 20.0 % end_POSTSUBSCRIPT 2.130.0%subscript2.1percent30.0\bm{2.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-30.0\%}}}bold_2.1 start_POSTSUBSCRIPT bold_- bold_30.0 bold_% end_POSTSUBSCRIPT 2.52.52.52.5 1.41.41.41.4
Table 12: WER (%) results of RobustGER on clean test data from VB-DEMAND and LS-FreeSound.
Test set Baseline LMrank𝑟𝑎𝑛𝑘{}_{rank}start_FLOATSUBSCRIPT italic_r italic_a italic_n italic_k end_FLOATSUBSCRIPT GER + Audio Denoising RobustGER Oracle
(ours) onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT ocpsubscript𝑜𝑐𝑝o_{cp}italic_o start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT
VB-DEMAND 1.31.31.31.3 1.51.51.51.5 1.30.0%subscript1.3percent0.01.3_{{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}% \pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}-0.0\%}}1.3 start_POSTSUBSCRIPT - 0.0 % end_POSTSUBSCRIPT 1.27.7%subscript1.2percent7.71.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.7\%}}1.2 start_POSTSUBSCRIPT - 7.7 % end_POSTSUBSCRIPT 0.746.2%subscript0.7percent46.2\bm{0.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.2\%}}}bold_0.7 start_POSTSUBSCRIPT bold_- bold_46.2 bold_% end_POSTSUBSCRIPT 0.60.60.60.6 0.30.30.30.3
LS-FreeSound 3.03.03.03.0 2.82.82.82.8 2.516.7%subscript2.5percent16.72.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.7\%}}2.5 start_POSTSUBSCRIPT - 16.7 % end_POSTSUBSCRIPT 2.420.0%subscript2.4percent20.02.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}2.4 start_POSTSUBSCRIPT - 20.0 % end_POSTSUBSCRIPT 2.130.0%subscript2.1percent30.0\bm{2.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-30.0\%}}}bold_2.1 start_POSTSUBSCRIPT bold_- bold_30.0 bold_% end_POSTSUBSCRIPT 2.52.52.52.5 1.41.41.41.4

D.2 Results on Different SNRs

Table 11 reports more results on different-SNR testing conditions. Similar to Table 2, we can observe consistent performance gains of RobustGER over vanilla GER and audio denosing baselines under different noise levels, i.e., ranging from 0 dB (quite noisy) to 20 dB (relatively clean). In addition, RobustGER also surpasses the reranking upper-bound onbsubscript𝑜𝑛𝑏o_{nb}italic_o start_POSTSUBSCRIPT italic_n italic_b end_POSTSUBSCRIPT under some testing scenarios, indicating the effectiveness of RobustGER over conventional LM rescoring methods.

Furthermore, we also report error correction results on clean test data from VB-DEMAND and LS-FreeSound datasets, where significant GER improvement with 46.2% and 30.0% relative WER reductions are achieved by RobustGER approach. This experimental evidence demonstrates the excellent generality of RobustGER for various ASR scenarios.

Table 13: Ablation study of the language-space noise embedding in terms of text embedding extractor. ``LLaMA Emb.'' denotes the input embedding layer of LLaMA-2-7b model.
Test Set Baseline GER + Audio Denoising + Language Denoising
LLaMA Emb. FastText SBERT
CHiME-4 test-real 12.612.612.612.6 6.548.4%subscript6.5percent48.46.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}6.5 start_POSTSUBSCRIPT - 48.4 % end_POSTSUBSCRIPT 6.449.2%subscript6.4percent49.26.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-49.2\%}}6.4 start_POSTSUBSCRIPT - 49.2 % end_POSTSUBSCRIPT 6.647.6%subscript6.6percent47.66.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.6\%}}6.6 start_POSTSUBSCRIPT - 47.6 % end_POSTSUBSCRIPT 6.250.8%subscript6.2percent50.86.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}6.2 start_POSTSUBSCRIPT - 50.8 % end_POSTSUBSCRIPT 5.953.2%subscript5.9percent53.2\bm{5.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}}bold_5.9 start_POSTSUBSCRIPT bold_- bold_53.2 bold_% end_POSTSUBSCRIPT
test-simu 15.415.415.415.4 9.240.3%subscript9.2percent40.39.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-40.3\%}}9.2 start_POSTSUBSCRIPT - 40.3 % end_POSTSUBSCRIPT 9.041.6%subscript9.0percent41.69.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-41.6\%}}9.0 start_POSTSUBSCRIPT - 41.6 % end_POSTSUBSCRIPT 8.942.2%subscript8.9percent42.28.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-42.2\%}}8.9 start_POSTSUBSCRIPT - 42.2 % end_POSTSUBSCRIPT 8.743.5%subscript8.7percent43.58.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-43.5\%}}8.7 start_POSTSUBSCRIPT - 43.5 % end_POSTSUBSCRIPT 8.644.2%subscript8.6percent44.2\bm{8.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-44.2\%}}}bold_8.6 start_POSTSUBSCRIPT bold_- bold_44.2 bold_% end_POSTSUBSCRIPT
dev-real 10.610.610.610.6 5.052.8%subscript5.0percent52.85.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.8\%}}5.0 start_POSTSUBSCRIPT - 52.8 % end_POSTSUBSCRIPT 4.953.8%subscript4.9percent53.84.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.8\%}}4.9 start_POSTSUBSCRIPT - 53.8 % end_POSTSUBSCRIPT 4.953.8%subscript4.9percent53.84.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.8\%}}4.9 start_POSTSUBSCRIPT - 53.8 % end_POSTSUBSCRIPT 4.557.5%subscript4.5percent57.54.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-57.5\%}}4.5 start_POSTSUBSCRIPT - 57.5 % end_POSTSUBSCRIPT 4.458.5%subscript4.4percent58.5\bm{4.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-58.5\%}}}bold_4.4 start_POSTSUBSCRIPT bold_- bold_58.5 bold_% end_POSTSUBSCRIPT
dev-simu 12.412.412.412.4 6.845.2%subscript6.8percent45.26.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.2\%}}6.8 start_POSTSUBSCRIPT - 45.2 % end_POSTSUBSCRIPT 6.646.8%subscript6.6percent46.86.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.8\%}}6.6 start_POSTSUBSCRIPT - 46.8 % end_POSTSUBSCRIPT 6.746.0%subscript6.7percent46.06.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.0\%}}6.7 start_POSTSUBSCRIPT - 46.0 % end_POSTSUBSCRIPT 6.448.4%subscript6.4percent48.46.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}6.4 start_POSTSUBSCRIPT - 48.4 % end_POSTSUBSCRIPT 6.150.8%subscript6.1percent50.8\bm{6.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}}bold_6.1 start_POSTSUBSCRIPT bold_- bold_50.8 bold_% end_POSTSUBSCRIPT
avg. 12.812.812.812.8 6.946.1%subscript6.9percent46.16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.1\%}}6.9 start_POSTSUBSCRIPT - 46.1 % end_POSTSUBSCRIPT 6.747.7%subscript6.7percent47.76.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-47.7\%}}6.7 start_POSTSUBSCRIPT - 47.7 % end_POSTSUBSCRIPT 6.846.9%subscript6.8percent46.96.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.9\%}}6.8 start_POSTSUBSCRIPT - 46.9 % end_POSTSUBSCRIPT 6.549.2%subscript6.5percent49.26.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-49.2\%}}6.5 start_POSTSUBSCRIPT - 49.2 % end_POSTSUBSCRIPT 6.350.8%subscript6.3percent50.8\bm{6.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}}bold_6.3 start_POSTSUBSCRIPT bold_- bold_50.8 bold_% end_POSTSUBSCRIPT
VB-DEMAND baby-cry 8.08.08.08.0 7.012.5%subscript7.0percent12.57.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}7.0 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 6.913.8%subscript6.9percent13.86.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.8\%}}6.9 start_POSTSUBSCRIPT - 13.8 % end_POSTSUBSCRIPT 6.815.0%subscript6.8percent15.06.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.0\%}}6.8 start_POSTSUBSCRIPT - 15.0 % end_POSTSUBSCRIPT 6.518.8%subscript6.5percent18.86.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-18.8\%}}6.5 start_POSTSUBSCRIPT - 18.8 % end_POSTSUBSCRIPT 6.420.0%subscript6.4percent20.0\bm{6.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}}bold_6.4 start_POSTSUBSCRIPT bold_- bold_20.0 bold_% end_POSTSUBSCRIPT
helicopter 8.48.48.48.4 7.411.9%subscript7.4percent11.97.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.9\%}}7.4 start_POSTSUBSCRIPT - 11.9 % end_POSTSUBSCRIPT 7.313.1%subscript7.3percent13.17.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.1\%}}7.3 start_POSTSUBSCRIPT - 13.1 % end_POSTSUBSCRIPT 7.510.7%subscript7.5percent10.77.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.7\%}}7.5 start_POSTSUBSCRIPT - 10.7 % end_POSTSUBSCRIPT 7.411.9%subscript7.4percent11.97.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.9\%}}7.4 start_POSTSUBSCRIPT - 11.9 % end_POSTSUBSCRIPT 7.115.5%subscript7.1percent15.5\bm{7.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.5\%}}}bold_7.1 start_POSTSUBSCRIPT bold_- bold_15.5 bold_% end_POSTSUBSCRIPT
crowd-party 22.622.622.622.6 21.45.3%subscript21.4percent5.321.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}21.4 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 21.07.1%subscript21.0percent7.121.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.1\%}}21.0 start_POSTSUBSCRIPT - 7.1 % end_POSTSUBSCRIPT 20.97.5%subscript20.9percent7.520.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-7.5\%}}20.9 start_POSTSUBSCRIPT - 7.5 % end_POSTSUBSCRIPT 20.310.2%subscript20.3percent10.220.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.2\%}}20.3 start_POSTSUBSCRIPT - 10.2 % end_POSTSUBSCRIPT 19.911.9%subscript19.9percent11.9\bm{19.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 1.9\%}}}bold_19.9 start_POSTSUBSCRIPT bold_- bold_11.9 bold_% end_POSTSUBSCRIPT
avg. 13.013.013.013.0 11.98.5%subscript11.9percent8.511.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.5\%}}11.9 start_POSTSUBSCRIPT - 8.5 % end_POSTSUBSCRIPT 11.710.0%subscript11.7percent10.011.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}11.7 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 11.710.0%subscript11.7percent10.011.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-10.0\%}}11.7 start_POSTSUBSCRIPT - 10.0 % end_POSTSUBSCRIPT 11.412.3%subscript11.4percent12.311.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.3\%}}11.4 start_POSTSUBSCRIPT - 12.3 % end_POSTSUBSCRIPT 11.114.6%subscript11.1percent14.6\bm{11.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 4.6\%}}}bold_11.1 start_POSTSUBSCRIPT bold_- bold_14.6 bold_% end_POSTSUBSCRIPT
Table 14: Comparison of different techniques for audio noise distillation. ``T-S Learning'' denotes teacher-student learning with KL regularization, ``Contra. Learning'' denotes contrastive learning.
Test Set Baseline GER + Lang. Denoising + Audio Noise Distillation
T-S learning Contra. learning MINE
CHiME-4 test-real 12.612.612.612.6 6.548.4%subscript6.5percent48.46.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-48.4\%}}6.5 start_POSTSUBSCRIPT - 48.4 % end_POSTSUBSCRIPT 5.953.2%subscript5.9percent53.25.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}5.9 start_POSTSUBSCRIPT - 53.2 % end_POSTSUBSCRIPT 5.953.2%subscript5.9percent53.25.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}5.9 start_POSTSUBSCRIPT - 53.2 % end_POSTSUBSCRIPT 5.854.0%subscript5.8percent54.05.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-54.0\%}}5.8 start_POSTSUBSCRIPT - 54.0 % end_POSTSUBSCRIPT 5.655.6%subscript5.6percent55.6\bm{5.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-55.6\%}}}bold_5.6 start_POSTSUBSCRIPT bold_- bold_55.6 bold_% end_POSTSUBSCRIPT
test-simu 15.415.415.415.4 9.240.3%subscript9.2percent40.39.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-40.3\%}}9.2 start_POSTSUBSCRIPT - 40.3 % end_POSTSUBSCRIPT 8.644.2%subscript8.6percent44.28.6_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-44.2\%}}8.6 start_POSTSUBSCRIPT - 44.2 % end_POSTSUBSCRIPT 8.743.5%subscript8.7percent43.58.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-43.5\%}}8.7 start_POSTSUBSCRIPT - 43.5 % end_POSTSUBSCRIPT 8.445.5%subscript8.4percent45.58.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.5\%}}8.4 start_POSTSUBSCRIPT - 45.5 % end_POSTSUBSCRIPT 8.246.8%subscript8.2percent46.8\bm{8.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.8\%}}}bold_8.2 start_POSTSUBSCRIPT bold_- bold_46.8 bold_% end_POSTSUBSCRIPT
dev-real 10.610.610.610.6 5.052.8%subscript5.0percent52.85.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.8\%}}5.0 start_POSTSUBSCRIPT - 52.8 % end_POSTSUBSCRIPT 4.458.5%subscript4.4percent58.54.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-58.5\%}}4.4 start_POSTSUBSCRIPT - 58.5 % end_POSTSUBSCRIPT 4.557.5%subscript4.5percent57.54.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-57.5\%}}4.5 start_POSTSUBSCRIPT - 57.5 % end_POSTSUBSCRIPT 4.260.4%subscript4.2percent60.44.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-60.4\%}}4.2 start_POSTSUBSCRIPT - 60.4 % end_POSTSUBSCRIPT 4.161.3%subscript4.1percent61.3\bm{4.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-61.3\%}}}bold_4.1 start_POSTSUBSCRIPT bold_- bold_61.3 bold_% end_POSTSUBSCRIPT
dev-simu 12.412.412.412.4 6.845.2%subscript6.8percent45.26.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-45.2\%}}6.8 start_POSTSUBSCRIPT - 45.2 % end_POSTSUBSCRIPT 6.150.8%subscript6.1percent50.86.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}6.1 start_POSTSUBSCRIPT - 50.8 % end_POSTSUBSCRIPT 6.051.6%subscript6.0percent51.66.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-51.6\%}}6.0 start_POSTSUBSCRIPT - 51.6 % end_POSTSUBSCRIPT 6.150.8%subscript6.1percent50.86.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}6.1 start_POSTSUBSCRIPT - 50.8 % end_POSTSUBSCRIPT 5.853.2%subscript5.8percent53.2\bm{5.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.2\%}}}bold_5.8 start_POSTSUBSCRIPT bold_- bold_53.2 bold_% end_POSTSUBSCRIPT
avg. 12.812.812.812.8 6.946.1%subscript6.9percent46.16.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-46.1\%}}6.9 start_POSTSUBSCRIPT - 46.1 % end_POSTSUBSCRIPT 6.350.8%subscript6.3percent50.86.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}6.3 start_POSTSUBSCRIPT - 50.8 % end_POSTSUBSCRIPT 6.350.8%subscript6.3percent50.86.3_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-50.8\%}}6.3 start_POSTSUBSCRIPT - 50.8 % end_POSTSUBSCRIPT 6.152.3%subscript6.1percent52.36.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-52.3\%}}6.1 start_POSTSUBSCRIPT - 52.3 % end_POSTSUBSCRIPT 5.953.9%subscript5.9percent53.9\bm{5.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-53.9\%}}}bold_5.9 start_POSTSUBSCRIPT bold_- bold_53.9 bold_% end_POSTSUBSCRIPT
VB-DEMAND baby-cry 8.08.08.08.0 7.012.5%subscript7.0percent12.57.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-12.5\%}}7.0 start_POSTSUBSCRIPT - 12.5 % end_POSTSUBSCRIPT 6.420.0%subscript6.4percent20.06.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}6.4 start_POSTSUBSCRIPT - 20.0 % end_POSTSUBSCRIPT 6.420.0%subscript6.4percent20.06.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-20.0\%}}6.4 start_POSTSUBSCRIPT - 20.0 % end_POSTSUBSCRIPT 6.222.5%subscript6.2percent22.56.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-22.5\%}}6.2 start_POSTSUBSCRIPT - 22.5 % end_POSTSUBSCRIPT 6.025.0%subscript6.0percent25.0\bm{6.0_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-25.0\%}}}bold_6.0 start_POSTSUBSCRIPT bold_- bold_25.0 bold_% end_POSTSUBSCRIPT
helicopter 8.48.48.48.4 7.411.9%subscript7.4percent11.97.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.9\%}}7.4 start_POSTSUBSCRIPT - 11.9 % end_POSTSUBSCRIPT 7.115.5%subscript7.1percent15.57.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-15.5\%}}7.1 start_POSTSUBSCRIPT - 15.5 % end_POSTSUBSCRIPT 7.214.3%subscript7.2percent14.37.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.3\%}}7.2 start_POSTSUBSCRIPT - 14.3 % end_POSTSUBSCRIPT 6.917.9%subscript6.9percent17.96.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}6.9 start_POSTSUBSCRIPT - 17.9 % end_POSTSUBSCRIPT 6.917.9%subscript6.9percent17.9\bm{6.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5% }\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-17.9\%}}}bold_6.9 start_POSTSUBSCRIPT bold_- bold_17.9 bold_% end_POSTSUBSCRIPT
crowd-party 22.622.622.622.6 21.45.3%subscript21.4percent5.321.4_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-5.3\%}}21.4 start_POSTSUBSCRIPT - 5.3 % end_POSTSUBSCRIPT 19.911.9%subscript19.9percent11.919.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.9\%}}19.9 start_POSTSUBSCRIPT - 11.9 % end_POSTSUBSCRIPT 20.111.1%subscript20.1percent11.120.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-11.1\%}}20.1 start_POSTSUBSCRIPT - 11.1 % end_POSTSUBSCRIPT 19.513.7%subscript19.5percent13.719.5_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.7\%}}19.5 start_POSTSUBSCRIPT - 13.7 % end_POSTSUBSCRIPT 19.215.0%subscript19.2percent15.0\bm{19.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 5.0\%}}}bold_19.2 start_POSTSUBSCRIPT bold_- bold_15.0 bold_% end_POSTSUBSCRIPT
avg. 13.013.013.013.0 11.98.5%subscript11.9percent8.511.9_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-8.5\%}}11.9 start_POSTSUBSCRIPT - 8.5 % end_POSTSUBSCRIPT 11.114.6%subscript11.1percent14.611.1_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-14.6\%}}11.1 start_POSTSUBSCRIPT - 14.6 % end_POSTSUBSCRIPT 11.213.8%subscript11.2percent13.811.2_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-13.8\%}}11.2 start_POSTSUBSCRIPT - 13.8 % end_POSTSUBSCRIPT 10.816.9%subscript10.8percent16.910.8_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-16.9\%}}10.8 start_POSTSUBSCRIPT - 16.9 % end_POSTSUBSCRIPT 10.717.7%subscript10.7percent17.7\bm{10.7_{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,.5,.5}\pgfsys@color@rgb@stroke{0}{.5}{.5}\pgfsys@color@rgb@fill{0}{.5}{.5}-1% 7.7\%}}}bold_10.7 start_POSTSUBSCRIPT bold_- bold_17.7 bold_% end_POSTSUBSCRIPT

D.3 Ablation Study of Language Embedding Extractor

Table 13 illustrates the ablation study of proposed language-space noise embedding with different text embedding extractors. First, we try the input word-to-embedding layer in LLaMA-2-7b to extract both utterance- and token-level embeddings in §4.2, which leads to minor gains over audio denosing baseline, indicating that the LLaMA embedding is less discriminative for audio noise modeling. The supervised text classifier FastText (Grave et al., 2018) provides a better solution to extract text embeddings for modeling the N-best list diversity. Benefiting from the powerful global context modeling ability of Transformer (Vaswani et al., 2017), SBERT (Reimers & Gurevych, 2019) presents the best performance for language-space noise embedding extraction, which well represents both utterance- and token-level embeddings as shown in Table 3.

D.4 Ablation Study of Audio Noise Distillation

Table 14 explores different KD approaches for audio noise distillation. The first one is teacher-student learning, which implements distillation by performing KL-divergence regularization between a trainable student and a frozen teacher, but it shows minor gains of performance. In comparison, contrastive learning technique achieves better results by introducing positive vs. negative samples to learn distinctiveness. However, it is still sub-optimal due to the large distance between language and audio spaces, i.e., the anchor (language noise embedding) is far away from the positive (noisy audio embedding) and negative (clean audio embedding) samples that are relatively closer to each other. To this end, our utilized MINE introduces a neural network to estimate and maximize mutual information, which is more direct and effective in manipulating representations in different spaces for knowledge distillation. As a result, MINE achieves the best performance of audio noise distillation.

Table 15: N-best hypotheses from a speech sample under different noise conditions. We use two noise types (i.e., Babble and AC/Vacuum) and two SNR levels (i.e., 0 and 10 dB) from LibriSpeech-FreeSound test set, where the original sample id is ``237-134500-0040''. The ``Acoustic Score'' denotes the decoding score from Whisper Large-V2 model, which is calculated by negative entropy. Red font highlights the wrong tokens compared to ground-truth transcription.
Noise SNR (dB) N-best Hypotheses Acoustic WER (%)
Type Score
Babble 0 i pray for them but that is not the same as i pray for sam 0.4670.467-0.467- 0.467 33.333.333.333.3
i pray for them but that is not the same as i pray for science 0.4850.485-0.485- 0.485 33.333.333.333.3
i pray for them but that is not the same as if i prayed for sam 0.5160.516-0.516- 0.516 26.726.726.726.7
i pray for them but that is not the same as i pray for sons 0.5170.517-0.517- 0.517 33.333.333.333.3
i pray for them but that is not the same as if i pray for sam 0.5210.521-0.521- 0.521 33.333.333.333.3
10 i pray for you but that is not the same as if you prayed yourself 0.3280.328-0.328- 0.328 0.00.00.00.0
i pray for you but that is not the same as if you prayed yourself 0.3280.328-0.328- 0.328 0.00.00.00.0
i pray for you but that is not the same as if you pray yourself 0.3400.340-0.340- 0.340 6.76.76.76.7
i pray for you but that is not the same as if you pray for yourself 0.4260.426-0.426- 0.426 13.313.313.313.3
i pray for you but that is not the same as if you prayed for yourself 0.4490.449-0.449- 0.449 6.76.76.76.7
AC 0 i pray for you but that is not the same as if you prayed yourself 0.3290.329-0.329- 0.329 0.00.00.00.0
i pray for you but that is not the same as if you pray yourself 0.3690.369-0.369- 0.369 6.76.76.76.7
i pray for you but that is not the same as if you pray for yourself 0.3880.388-0.388- 0.388 13.313.313.313.3
i would pray for you but that is not the same as if you prayed yourself 0.4280.428-0.428- 0.428 6.76.76.76.7
i pray for you but that is not the same as if you prayed for yourself 0.4290.429-0.429- 0.429 6.76.76.76.7
10 i pray for you but that is not the same as if you prayed yourself 0.3050.305-0.305- 0.305 0.00.00.00.0
i pray for you but that is not the same as if you prayed yourself 0.3050.305-0.305- 0.305 0.00.00.00.0
i prayed for you but that is not the same as if you prayed yourself 0.3430.343-0.343- 0.343 6.76.76.76.7
i prayed for you but that is not the same as if you prayed yourself 0.3430.343-0.343- 0.343 6.76.76.76.7
i prayed for you but that is not the same as if you prayed yourself 0.3430.343-0.343- 0.343 6.76.76.76.7
Clean \infty i pray for you but that is not the same as if you prayed yourself 0.2800.280-0.280- 0.280 0.00.00.00.0
i pray for you but that is not the same as if you prayed yourself 0.2800.280-0.280- 0.280 0.00.00.00.0
i pray for you but that is not the same as if you prayed yourself 0.2800.280-0.280- 0.280 0.00.00.00.0
i pray for you but that is not the same as if you prayed yourself 0.2800.280-0.280- 0.280 0.00.00.00.0
i pray for you but that is not the same as if you prayed yourself 0.2800.280-0.280- 0.280 0.00.00.00.0
Ground Truth i pray for you but that is not the same as if you prayed yourself - -
Refer to caption
Figure 6: The t-SNE visualizations of language-space noise embeddings from source speech under different noise types and SNR levels. The average distances between embeddings of clean and various noisy conditions are: 58.6 (babble_0dB), 24.5 (babble_10dB), 22.6 (ac_0dB) and 14.3 (ac_10dB).

D.5 Relationship between Noisy Speech and N-best List Diversity

As introduced in §1, our insight of proposing language-space noise embedding to represent audio noise is the relationship between the noise conditions of source speech and the diversity of decoded N-best list from ASR model, i.e., the worse noisy conditions (more challenging noise type or lower SNR), the higher uncertainty of ASR beam search decoding, and thus results in more diverse N-best hypotheses. To verify the reliability of this insight, we present the N-best hypotheses from a speech sample under different noise conditions in Table 15. For Babble noise, we can observe that 0 dB yields higher decoding uncertainty (i.e., lower acoustic scores) than 10 dB, which results in more diverse N-best hypotheses and worse 1-best WER, i.e., more language noise. Similar phenomenon can be observed in AC noise condition. On the other hand, we notice from Table 11 that Babble noise under same SNR level yields worse ASR results than AC noise, which means Babble is a more challenging noise type. As a result, Babble_0dB produces more diverse N-best list than AC_0dB, which is same for Babble_10dB and AC_10dB. In particular, the highly intelligible clean speech yields no N-best diversity. Fig. 6 visualize the language noise that originates from different audio noise, where the distances between clusters well represent the noise levels of source speech.

In summary, the relationship between the audio noise in source speech and the language noise in decoded N-best list inspires us to propose language-space denoising. Fortunately, the powerful generation ability of LLMs promotes the success of this research idea.

Appendix E Limitations

Though effective in improving noisy ASR performance, there still exist some limitations in the proposed RobustGER.

  • Table 16 presents a failure case on CHiME-4 dev-real set. There is one error in N-best hypotheses, i.e., the word ``Miss'' that should be ``Ms'' in ground truth. The GER baseline successfully corrects this error while RobustGER fails. The reason could be, the words ``Ms'' (/mIz/) and ``Miss'' (/mIs/) sound similar especially under noisy scenarios, GER cannot distinguish them so it depends on LLMs to decide based on context. Thanks to the rich linguistic knowledge and powerful reasoning ability, LLMs enable GER to generate the correct word ``Ms'' that is more appropriate than ``Miss'' in this context. On the other hand, with the proposed language-space denoising, RobustGER successfully perceives the trivial difference between their pronunciations but find the word is more likely to be ``Miss'' (e.g., maybe the speaker’s pronunciation is not standard). Such information misleads LLMs to generate the wrong word. Therefore, this is a problem of trade-off between contextual information and denoising for LLMs to generate correct transcription: 1) when both homophones suit the context, LLMs should be carefully in denoising to find the correct word (see Table 5), 2) when one of homophones is obviously more suitable to the context than another one, LLMs may not need denoising as it could provide misleading information. We believe this could be a promising research direction for future work on GER.

  • We observe from main results in Table 1 that both GER and our RobustGER achieves significantly more improvements on CHiME-4 dataset than other datasets. This phenomenon has been also observed and analyzed in the original GER benchmark (Chen et al., 2023b), as there are many financial terminologies in the transcriptions of CHiME-4 that are relatively easier for LLMs to correct. Therefore, in future we may need a analysis of error types for CHiME-4 to understand how RobustGER works there.

  • After our initial draft was released on OpenReview in September 2023, we also learned that there have been recent developments in post-recognition text modeling, as well as LLM based efforts in audio understanding (Gong et al., 2023a; b; Wu et al., 2023b) and speaker diarization (Park et al., 2023; Wang et al., 2024). We hope to align the efforts of different research groups to enable more robust and resilient text modeling evaluations for various speech and audio processing tasks in the future, as part of a collaborative community effort.

Table 16: Failure case of RobustGER. The test sample is from CHiME-4 dev-real dataset with ID as ``M03_052C010R_BUS''.
Method Utterance WER (%)
N-best List miss amsterdam declined to comment 20.020.020.020.0
miss amsterdam declined to comment 20.020.020.020.0
ms amsterdam declined to comment 0.00.00.00.0
miss amsterdam declined to comment 20.020.020.020.0
miss amsterdam decline to comment 40.040.040.040.0
GER ms amsterdam declined to comment 0.00.0\bm{0.0}bold_0.0
RobustGER miss amsterdam declined to comment 20.020.020.020.0
Ground Truth ms amsterdam declined to comment -
Table 17: Distances between the language noise embeddings from clean and different noisy conditions. The corresponding t-SNE visualizations are presented in Fig. 4.

Clean vs.

ac babble cafe car metro traffic avg.

Language Noise Emb.

59.759.7\bm{59.7}bold_59.7 54.954.954.954.9 32.432.432.432.4 12.712.712.712.7 19.119.119.119.1 17.417.417.417.4 32.732.732.732.7

  + Audio Distillation

57.657.657.657.6 87.587.5\bm{87.5}bold_87.5 53.253.2\bm{53.2}bold_53.2 37.537.5\bm{37.5}bold_37.5 32.132.1\bm{32.1}bold_32.1 51.851.8\bm{51.8}bold_51.8 53.353.3\bm{53.3}bold_53.3