Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
Abstract
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with ``HyPoradise'' dataset to learn the map** from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising111This work is open sourced at: https://github.com/YUCHEN005/RobustGER.
1 Introduction
Recent advances in large language models (LLMs) have attracted a surge of research interest due to their representation power of language generation (OpenAI, 2022; 2023; Touvron et al., 2023a), which achieve a wide range of success on natural language processing (NLP) tasks (Brown et al., 2020; Wei et al., 2022; Ouyang et al., 2022). Powered by LLMs, latest works (Chen et al., 2023b; Yang et al., 2023a) propose a generative error correction (GER) framework222https://github.com/Hypotheses-Paradise/Hypo2Trans for automatic speech recognition (ASR), along with a ``HyPoradise'' dataset333https://huggingface.co/datasets/PeacefulData/Robust-HyPoradise that contains abundant pairs of ASR N-best hypotheses and ground-truth transcription. It has shown great performance in learning the map** from hypotheses to transcription by parameter-efficient LLM finetuning (Hu et al., 2021), which significantly outperforms typical LM rescoring methods (Mikolov et al., 2010). However, their study lacks specificity on noisy ASR scenarios, which are the most common in real world (Li et al., 2015).
In this work, we extend the GER benchmark to noisy conditions, as well as propose a Robust HyPoradise (RobustHP) dataset with 113K hypotheses-transcription pairs from various ASR corpus in common noisy scenarios. Similar to the original benchmark, we also observe error correction improvement of LLM finetuning on noisy ASR, but the performance gain in most noisy conditions is still limited (see Table 1). It indicates that LLMs-based GER is still prone to source audio noise (see our case study in Table 5). Luckily, we draw inspiration from the noise-robust ASR community. Their key idea is to map noisy speech features to clean space (i.e., denoise) before recognition (Li et al., 2014), where speech enhancement denoising (Pandey et al., 2021) is one of the most popular approaches. Therefore, we raise a research question for our case: Can we teach LLMs to denoise the N-best hypotheses for GER, just like what robust ASR and speech enhancement do?
![Refer to caption](x1.png)
Inspired by recent works on LLM adaptation (Wu et al., 2023a; Fathullah et al., 2023; Gao et al., 2023), a general solution here is to incorporate audio noise information as a conditioner into LLM finetuning to make it noise-aware, which is also similar to the popular conditional diffusion model (Dhariwal & Nichol, 2021). However, latest works find that directly introducing other modalities (e.g., audio, visual) into LLM finetuning could harm its stability and performance due to cross-modality gap (Zhang et al., 2023b; Li et al., 2023b). Our examination in Table 1 also indicates this limitation.
To this end, we propose to extract a noise embedding in language space to represent the noise conditions of source speech, by measuring the diversity of N-best hypotheses list from ASR decoding. The insight behind is that, the worse noisy conditions (more challenging noise type or lower SNR), the higher uncertainty of ASR beam search decoding, and thus results in more diverse N-best hypotheses, which has been illustrated in Table 15 and Fig 6. Extracted from the language space of hypotheses instead of audio space, our noise embedding can be well incorporated into LLM tuning to improve GER, which can be viewed as a novel language-space denoising process. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation (Belghazi et al., 2018) to distill the real noise information in audio embeddings to our extracted language embedding. As a result, it presents stronger noise representativeness (see Fig. 4(b)) and enhances the denoising performance. Various latest LLMs (e.g., LLaMA-2 (Touvron et al., 2023b), LLaMA (Touvron et al., 2023a) and Falcon (Penedo et al., 2023)) are utilized to verify the effectiveness of our approach, and the comprehensive experimental results demonstrate that our model improves the GER performance with up to 53.9% word error rate (WER) reduction on RobustHP test sets while with limited training data.
Our contribution can be summarized as follows:
-
•
We extend the latest ASR generative error correction benchmark to noise-robust ASR, where a Robust HyPoradise (RobustHP) dataset with 113K hypotheses-transcription pairs is collected from various ASR corpus in common noisy conditions.
-
•
We propose RobustGER, a noise-aware generative error correction approach based on LLMs to map N-best hypotheses to true transcription, where an extracted language-space noise embedding with audio distillation is utilized to teach LLMs to perform denoising.
-
•
Experiments on various latest LLMs show the proposed approach achieves a new breakthrough on RobustHP with up to 53.9% GER improvement in terms of word error rate (WER). Analysis verifies the effectiveness of our proposed language-space embedding to represent audio noise, under which LLMs show strong ability of language-space denoising.
2 Related Work
Large Language Models and Parameter-efficient Finetuning. There is recently a surge of research interests in Transformer-based LLMs, such as ChatGPT (OpenAI, 2022), GPT-4 (OpenAI, 2023) and LLaMA (Touvron et al., 2023a). Benefiting from giant model size and abundant training data, LLMs can understand the linguistic structures and semantic meanings behind text, which shows remarkable performance on a wide range of NLP tasks (Brown et al., 2020; Wei et al., 2022; Ouyang et al., 2022). To adapt LLMs to downstream tasks, many recent works investigate parameter-efficient LLM finetuning (Hu et al., 2021) considering its huge model size. In order to further exploit the potential of LLMs on multimodal tasks, more recent works investigate to incorporate other modalities into LLM tuning (Wu et al., 2023a; Fathullah et al., 2023; Li et al., 2023a; Chen et al., 2023c; Zhang et al., 2023a; b; Gao et al., 2023; Wang et al., 2023; Radhakrishnan et al., 2023). However, the latest works find that directly introducing other modalities into LLMs could harm the finetuning stability and performance due to the heterogeneous cross-modality gap (Zhang et al., 2023b; Li et al., 2023b). Therefore, this work proposes to extract a language embedding from the N-best list to represent audio noise, which works well in teaching LLMs to perform denoising.
LM Rescoring and ASR Generative Error Correction. LM rescoring has been widely used in ASR decoding to improve the linguistic acceptability of recognition results, which achieves stable gains of ASR performance (Arisoy et al., 2015; Shin et al., 2019; Mikolov et al., 2010; Yang et al., 2021; Yu et al., 2023). Typically, an external LM is deployed to rescore the N-best hypotheses list from ASR beam search decoding to rerank out the 1-best candidature. Furthermore, to make full use of all candidatures, recent works use the entire N-best list for error correction (Leng et al., 2021; Ma et al., 2023; Hu et al., 2020; 2023; Guo et al., 2019; Hu et al., 2022; Chen et al., 2023a), which outperforms rescoring methods. Powered by LLMs, the latest works propose generative error correction (GER) benchmark (Chen et al., 2023b) to directly predict the ground-truth transcription from ASR N-best hypotheses. To enable the learning of hypotheses-to-transcription map**, they also propose a HyPoradise dataset with 316K hypotheses-transcription pairs. This work extends the GER benchmark to the most common noisy ASR scenarios with a new Robust HyPoradise dataset.
Noise-robust ASR. Neural ASR has achieved human-level performance but its noise-robustness in the real world remains a challenge (Krishna et al., 2019). Recent noise-robust ASR methods make some progress by map** noisy speech features to clean space (i.e., denoise) before recognition (Li et al., 2014). For instance, speech enhancement serves as a denoising front-end (Fu et al., 2019) to improve speech quality for ASR (Pandey et al., 2021), domain adversarial training aims to learn noise-invariant speech features (Prasad et al., 2021), and the recent ASR foundation model uses web-scale data and various preprocessing steps for denoising (Radford et al., 2023). Inspired by them, this work investigates to teach LLMs to denoise the N-best hypotheses in language space for GER.
3 Benchmark and Dataset
3.1 Generative Error Correction Benchmark
We extend original generative error correction benchmark (Chen et al., 2023b) to noise-robust ASR. Given an input noisy speech , the pre-trained ASR model first transcribe it into -best hypotheses by beam search decoding. The goal of GER is to learn a hypotheses-to-transcription (H2T) map** that predicts the transcription based on -best list :
(1) |
Given the ground-truth transcription , we can finetune the LLM to learn in an auto-regressive manner, where the cross-entropy loss is formulated as:
(2) |
where is the -th token of , and denotes the learnable parameters in LLM (i.e., adapter).
3.2 Robust HyPoradise Dataset
Correspondingly, we develop a Robust HyPoradise dataset by collecting hypotheses-transcription (HT) pairs from common noisy ASR corpus, including CHiME-4 (Vincent et al., 2016), VoiceBank-DEMAND (Valentini-Botinhao et al., 2016), NOIZEUS (Hu & Loizou, 2006), LibriSpeech-FreeSound (Prasad et al., 2021) and RATS (Graff et al., 2014), with details provided in §A. We employ Whisper Large-V2 (Radford et al., 2023), the state-of-the-art ASR foundation model to transcribe the noisy speech into N-best hypotheses (N is set to 5). As a result, we collect 113K HT pairs in total from various noise domains, and the dataset statistics are presented in Table 6.
4 Method
![Refer to caption](x2.png)
In this section, we present our noise-aware generative error correction (RobustGER) approach. We first describe the overall framework (§4.1), and then we introduce the extraction of language-space noise embedding from N-best hypotheses (§4.2), followed by audio noise distillation (§4.3) at last.
4.1 Overall Framework
The left part of Fig. 2 presents the overall framework of RobustGER. First, the noisy speech is sent into a pre-trained ASR model to generate N-best hypotheses , where . Following that, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech . As depicted in the right part of Fig. 2, such noise embedding measures the diversity of N-best hypotheses on both utterance and token levels, which perceives the noise information in input speech.
Furthermore, to enhance its noise representation ability, we design a KD approach to distill the real noise information in source speech to the extracted language-space noise embedding . Specifically, we employ the audio embedding from ASR encoder for distillation.
Finally, we add an instruction onto the N-best hypotheses and sent them into LLM to predict the true transcription (i.e., GER), with the language embedding incorporated for denoising. Specifically, we add a minus sign before the noise embedding to indicate ``denoise''. Such minus embedding is then sent to teach LLM to do language-space denoising. Therefore, Eq.(1) should be re-written as:
(3) |
The denotes H2T map** by efficient LLM finetuning, where we follow the adapter tuning from previous works (Zhang et al., 2023b; Yang et al., 2023b). We also borrow their idea of input-level prompting to incorporate our language noise embedding into LLM tuning, and the details are presented in §B.1. Similar to Eq.(2), we follow the original GER benchmark for optimization.
4.2 Language-space Noise Embedding
As directly incorporating audio-space noise embedding into LLM finetuning could harm its stability and performance (Zhang et al., 2023b; Gao et al., 2023), we propose an alternative to extract language-space noise embedding from N-best hypotheses to represent the noise conditions of source speech. The key idea is to perceive the audio noise from the diversity of N-best hypotheses, i.e., the worse noisy conditions (more challenging noise type or lower SNR), the higher uncertainty of ASR beam search decoding, and thus results in more diverse N-best hypotheses (see Table 15 and Fig 6).
As illustrated in the right part of Fig. 2, we extract the noise embedding on both utterance and token levels to capture rich diversity information: 1) Utterance-level: examine the diversity inside N-best list in terms of the entire utterance's semantic meaning, which indicates the affect of audio noise on the global semantics of hypotheses; 2) Token-level: examine the distribution of N-best hypothesis in terms of all the tokens inside, which is similar to edit distance and thus directly corresponds to the WER metric. These two embeddings are finally combined to form the resulted noise embedding, i.e., . Specifically, we employ sentence-BERT (SBERT) (Reimers & Gurevych, 2019) to obtain the embeddings from raw text, which contains rich language-space semantic information.
4.2.1 Utterance-level Noise Embedding
Given N-best hypotheses , we first obtain their sentence embeddings by SBERT encoder and then calculate their diversity as:
(4) |
where denotes the embedding size of SBERT extractor. In short, it concatenates all the sentence embedding differences where , resulting in an utterance-level noise embedding . The key idea is, ranks lower than in the N-best hypotheses list, which thus presents lower confidence and worse transcription quality, i.e., more language noise. Therefore, Eq.(4) serves as a measurement of the audio noise in language space. The worse noisy speech would lead to larger ASR decoding uncertainty and thus more diverse N-best hypotheses, so that Eq.(4) can capture larger diversity embedding.
4.2.2 Token-level Noise Embedding
Apart from utterance-level embedding, we also propose to extract token-level noise embedding that directly corresponds to the WER metric of ASR task. As shown in the bottom-right part of Fig. 2, similar to the calculation of edit distance, we first forced-align the N-best hypotheses to the same length with zero padding (i.e., ``Ø''). The aligned N-best hypotheses clearly illustrates the token difference between different candidatures, where each utterance contains tokens that comes from ASR vocabulary plus zero padding Ø:
(5) |
Inspired by edit distance, we design an ``edit embedding'' to capture the token-level difference between two hypotheses, which directly corresponds to their gap in final WER performance. Then, similar to Eq.(4), we calculate the token-level noise embedding by summing up the edit embedding between different pairs of hypotheses in the N-best list:
(6) | ||||
Note that we employ SBERT again to extract the token embedding, as it can produce informative embeddings for both utterances and tokens (Reimers & Gurevych, 2019).
4.3 Audio Noise Distillation
![Refer to caption](x3.png)
After extracting the language-space noise embedding from N-best hypotheses, we further propose an audio noise distillation approach via mutual information estimation to enhance its noise representation ability. Mutual information (MI) is a measure of dependence between random variables based on the Shannon entropy, which is equivalent to the Kullback-Leibler (KL-) divergence between the joint distribution and the product of the marginal distribution of random variables. Given two random variables and , their MI can be calculated by:
(7) |
where denotes KL-divergence. However, it is intractable to directly calculate MI based on Eq.(7), so we leverage an estimation method called mutual information neural estimation (MINE) from previous work (Belghazi et al., 2018). MINE employs a statistics network parameterized by to estimate a neural information measure:
(8) |
In practice, we employ the extracted language-space noise embedding and noisy audio embedding as the joint distribution, while using and clean audio embedding as the marginal distribution, as the noise information only exists in noisy speech.
Algorithm 1 describes how MINE is utilized for audio noise distillation, which includes two stages. First, the statistics network is trained to learn accurate MI estimation using both the positive and negative sample pairs introduced above. Second, a learnable tuner is introduced to modulate the language embedding to capture more real noise information, by maximizing the MI between it and the noisy audio embeddings. More details about the MINE-based audio noise distillation are in §B.2. In addition, the LLM adapter is also updated in second stage to learn H2T map** for GER.
5 Experiments
Test Set | Baseline | LM | GER | + Audio Denoising | RobustGER | Oracle | ||
(ours) | ||||||||
CHiME-4 | test-real | |||||||
test-simu | ||||||||
dev-real | ||||||||
dev-simu | ||||||||
avg. | ||||||||
VB-DEMAND | baby-cry | |||||||
helicopter | ||||||||
crowd-party | ||||||||
avg. | ||||||||
NOIZEUS | babble | |||||||
car | ||||||||
station | ||||||||
train | ||||||||
street | ||||||||
airport | ||||||||
exhibition | ||||||||
restaurant | ||||||||
avg. | ||||||||
LS-FreeSound | metro | |||||||
car | ||||||||
traffic | ||||||||
cafe | ||||||||
babble | ||||||||
ac/vacuum | ||||||||
avg. | ||||||||
RATS | test |
Noise Type | SNR (dB) | Baseline | LM | GER | + Audio Denoising | RobustGER | Oracle | |
(ours) | ||||||||
Metro | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
AC/Vacuum | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
Clean |
Test Set | Baseline | GER | + Audio Denoising | + Language Denoising | |||
---|---|---|---|---|---|---|---|
Utt.-level | Tok.-level | Both | |||||
CHiME-4 | test-real | ||||||
test-simu | |||||||
dev-real | |||||||
dev-simu | |||||||
avg. | |||||||
VB-DEMAND | baby-cry | ||||||
helicopter | |||||||
crowd-party | |||||||
avg. |
![Refer to caption](x4.png)
5.1 Setup
We conduct experiments on the proposed RobustHP dataset, which is detailed in §A. To verify the general effectiveness of our approach, we utilize various latest LLMs for evaluation, including LLaMA-2-7b/13b (Touvron et al., 2023b), LLaMA-7b (Touvron et al., 2023a) and Falcon-7b (Penedo et al., 2023). We follow the LLM-Adapter in previous work (Zhang et al., 2023b) for both LLM finetuning and noise embedding incorporation. Details of model and experiment setups are in §C.
We report experimental results in terms of word error rate (WER) and relative GER improvement. We also report two oracle WERs for reference: 1) N-best oracle : WER of the ``best candidate'' in N-best list, and 2) compositional oracle : best achievable WER using all the tokens in N-best hypotheses. They indicate the upper-bounds of rerank and GER (using occurred tokens), respectively.
5.2 Performance of RobustGER
Table 1 presents the experiment results on LLaMA-2-7b, and more LLMs are evaluated in §D.1. First, we can observe minor gains of performance brought by typical LM rescoring over the Whisper ASR baseline. Compared to LM rescoring, GER achieves promising progress by leveraging LLMs to generate transcription, while its performance gains in most noisy conditions except CHiME-4 are still limited. Introducing audio denoising further improves the result but suffers from the cross-modality gap. In comparison, with the proposed language-space denoising approach, our RobustGER achieves significant gains of performance in various noise conditions, with up to 53.9% GER improvement in terms of WER metric, where some results even surpass the reranking upper-bound.
Table 2 reports the performance of RobustGER under different SNRs, where we can observe consistent WER improvements on various noise levels. In addition, RobustGER also shows great effectiveness on clean test data with 30.0% relative WER reduction, which verifies its excellent generality.
5.3 Ablation Study
Table 3 illustrates the ablation study on the extraction of language-space noise embedding, which includes both utterance- and token-level information as introduced in §4.2. We can observe that utterance-level embedding only yields minor improvements over vanilla GER, indicating that the global semantics diversity of N-best hypotheses is not fine-grained enough for error correction. On the other hand, token-level information plays a significant role in language-space denoising for GER, as it directly corresponds to the word error rate metric. Combining both performs the best by leveraging richer information to measure N-best list diversity.
In addition, we also conduct ablation studies on the language embedding extractor (i.e., SBERT vs. FastText (Grave et al., 2018), LLaMA embedding.) in §D.3, as well as the audio noise distillation techniques (i.e., MINE vs. contrastive learning, teacher-student learning) in §D.4. All of them verify the effectiveness of our specific designs in RobustGER system.
5.4 Analysis
Visualizations of Noise Embeddings. Fig. 4 visualizes the language-space noise embedding to show its representativeness of audio noise. First, we can observe from Fig. (a) that our extracted language embedding from the N-best list can well represent some noise types (i.e., ``ac'', ``babble'', ``cafe''), while the others are intertwined with clean embeddings, indicating less optimal noise representations. For reference, the audio noise embeddings in Fig. (c) distinguish well between different conditions. Therefore, we design a KD approach to distill the real noise information in audio embedding to our language embedding. Fig. (b) shows it disentangles the embeddings from different noise conditions and improves their noise representativeness, which leads to better WER results as shown in Table 14.
Data Efficiency. As shown in Table 4, we further discuss the data efficiency of RobustGER using the CHiME-4 dataset, whose training set contains 9.6k HT pairs decoded from 17.5-hour speech data. As we gradually reduce the training data, we find that using around half-size data (i.e., 5k pairs) can still maintain the WER performance, i.e., vs. . When it decreases to 2k pairs, RobustGER is still comparable to GER, i.e., vs. . This experimental evidence verifies the data efficiency of RobustGER, which may originate from the attribute of parameter-efficient LLM finetuning.
Case Study. Table 5 illustrates a case study to demonstrate the effectiveness of RobustGER. There are two errors in N-best hypotheses, i.e., ``write ups'' (in 1-best) and ``ride outs'', where the ground truth is ``write offs''. Both ChatGPT-based in-context learning and LLaMA-based GER fail to correct this error, because the words ``write ups'' and ``write offs'' sound quite similar under noisy scenarios. In comparison, our RobustGER can correct this error by language-space denoising, where our proposed noise-representative embedding teaches LLMs to remove the language noise in N-best hypotheses that is caused by audio noise. More importantly, the semantic meanings of ``write ups'' and ``write offs'' are opposite, which highlights the significance of successful error correction by our RobustGER.
Test Set | Baseline | GER | RobustGER | ||||
---|---|---|---|---|---|---|---|
1k | 2k | 5k | 8k | 9.6k (full) | |||
Training Hours | - | ||||||
test-real | |||||||
test-simu | |||||||
dev-real | |||||||
dev-simu | |||||||
avg. |
Method | Utterance | WER (%) |
---|---|---|
N-best List | the four other utility company owners will also have to take write ups | |
the four other utility company owners will also have to take write ups | ||
the four other utility company owners will also have to take write ups | ||
the four other utility company owners will also have to take ride outs | ||
the four other utility company owners will also have to take ride outs | ||
In-context Learning | the four other utility company owners will also have to take write-ups | |
GER | the four other utility company owners will also have to take write ups | |
RobustGER | the four other utility company owners will also have to take write offs | |
Ground Truth | the four other utility company owners will also have to take write offs | - |
6 Conclusion
In this paper, we first extend the latest ASR generative error correction benchmark to the most common noisy scenarios in real world, with a proposed RobustHP dataset containing 113K hypotheses-transcription pairs decoded from various noisy ASR corpus. Based on that, we propose RobustGER, a noise-aware generative error correction approach based on LLMs to predict the ground-truth transcription based on N-best hypotheses, where an extracted language-space noise embedding with audio distillation is leveraged to teach LLMs to perform denoising in language space. Extensive experiments on various latest LLMs show that our approach achieves a new breakthrough on RobustHP dataset with up to 53.9% error correction improvement in terms of WER while with limited training data. Further analysis verifies the effectiveness of our proposed language-space embedding to represent audio noise, under which off-the-shelf LLMs show strong ability of language-space denoising.
References
- Arisoy et al. (2015) Ebru Arisoy, Abhinav Sethy, Bhuvana Ramabhadran, and Stanley Chen. Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5421–5425. IEEE, 2015.
- Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chen et al. (2023a) Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Hexin Liu, Sabato Marco Siniscalchi, and Eng Siong Chng. Generative error correction for code-switching speech recognition using large language models. arXiv preprint arXiv:2310.13013, 2023a.
- Chen et al. (2023b) Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Marco Siniscalchi, Pin-Yu Chen, and Ensiong Chng. Hyporadise: An open baseline for generative speech recognition with large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b.
- Chen et al. (2023c) Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, **g Shi, Shuang Xu, and Bo Xu. X-llm: Bootstrap** advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023c.
- Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, **g**g Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- Fathullah et al. (2023) Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, **xi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, et al. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
- Feldman et al. (2023) Philip Feldman, James R Foulds, and Shimei Pan. Trap** llm hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085, 2023.
- Font et al. (2013) Frederic Font, Gerard Roma, and Xavier Serra. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, pp. 411–412, 2013.
- Fu et al. (2019) Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In International Conference on Machine Learning, pp. 2031–2041. PMLR, 2019.
- Gao et al. (2023) Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Gong et al. (2023a) Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, 2023a.
- Gong et al. (2023b) Yuan Gong, Alexander Liu, Hongyin Luo, Leonid Karlinsky, and James Glass. Joint audio and speech understanding. In IEEE Proc. ASRU, 2023b.
- Graff et al. (2014) David Graff, Kevin Walker, Stephanie M Strassel, Xiaoyi Ma, Karen Jones, and Ann Sawyer. The rats collection: Supporting hlt research with degraded audio data. In LREC, pp. 1970–1977. Citeseer, 2014.
- Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
- Guo et al. (2019) **xi Guo, Tara N Sainath, and Ron J Weiss. A spelling correction model for end-to-end speech recognition. In Proc. ICASSP, pp. 5651–5655. IEEE, 2019.
- Hirsch & Pearce (2000) Hans-Günter Hirsch and David Pearce. The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRW), 2000.
- Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Hu et al. (2020) Ke Hu, Tara N Sainath, Ruoming Pang, and Rohit Prabhavalkar. Deliberation model based two-pass end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7799–7803. IEEE, 2020.
- Hu et al. (2022) Ke Hu, Tara N Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, and Weiran Wang. Improving deliberation by text-only and semi-supervised training. arXiv preprint arXiv:2206.14716, 2022.
- Hu et al. (2023) Ke Hu, Bo Li, and Tara N Sainath. Scaling up deliberation for multilingual asr. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 771–776. IEEE, 2023.
- Hu & Loizou (2006) Yi Hu and Philipos C Loizou. Subjective comparison of speech enhancement algorithms. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 1, pp. I–I. IEEE, 2006.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Krishna et al. (2019) Gautam Krishna, Co Tran, Jianguo Yu, and Ahmed H Tewfik. Speech recognition with no speech or with noisy speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1090–1094. IEEE, 2019.
- Leng et al. (2021) Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, ** Xu, Wenjie Liu, Linquan Liu, Tao Qin, Xiang-Yang Li, Edward Lin, et al. Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. arXiv preprint arXiv:2109.14420, 2021.
- Li et al. (2014) **yu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745–777, 2014.
- Li et al. (2015) **yu Li, Li Deng, Reinhold Haeb-Umbach, and Yifan Gong. Robust automatic speech recognition: a bridge to practical applications, chapter 1, pp. 1–20. Academic Press, 2015.
- Li et al. (2023a) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Li et al. (2022) Yanxi Li, Xinghao Chen, Min**g Dong, Yehui Tang, Yunhe Wang, and Chang Xu. Spatial-channel token distillation for vision mlps. In International Conference on Machine Learning, pp. 12685–12695. PMLR, 2022.
- Li et al. (2023b) Yuang Li, Yu Wu, **yu Li, and Shujie Liu. Prompting large language models for zero-shot domain adaptation in speech recognition. arXiv preprint arXiv:2306.16007, 2023b.
- Lin et al. (2021) Hsin-Yi Lin, Huan-Hsin Tseng, Xugang Lu, and Yu Tsao. Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport. Advances in Neural Information Processing Systems, 34:19935–19946, 2021.
- Loshchilov & Hutter (2018) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Ma et al. (2023) Rao Ma, Mark JF Gales, Kate Knill, and Mengjie Qian. N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space. arXiv preprint arXiv:2303.00456, 2023.
- Mikolov et al. (2010) Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech, volume 2, pp. 1045–1048. Makuhari, 2010.
- OpenAI (2022) OpenAI. Introducing chatgpt. OpenAI Blog, 2022.
- OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Panayotov et al. (2015) Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
- Pandey et al. (2021) Ashutosh Pandey, Chunxi Liu, Yun Wang, and Yatharth Saraf. Dual application of speech enhancement for automatic speech recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 223–228. IEEE, 2021.
- Park et al. (2023) Tae ** Park, Kunal Dhawan, Nithin Koluguri, and Jagadeesh Balam. Enhancing speaker diarization with large language models: A contextual beam search approach. arXiv preprint arXiv:2309.05248, 2023.
- Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Prasad et al. (2021) Archiki Prasad, Preethi Jyothi, and Rajbabu Velmurugan. An investigation of end-to-end models for robust speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6893–6897. IEEE, 2021.
- Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
- Radhakrishnan et al. (2023) Srijith Radhakrishnan, Chao-Han Yang, Sumeer Khan, Rohit Kumar, Narsis Kiani, David Gomez-Cabrero, and Jesper Tegnér. Whispering llama: A cross-modal generative error correction framework for speech recognition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10007–10016, 2023.
- Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Shin et al. (2019) Joonbo Shin, Yoonhyung Lee, and Kyomin Jung. Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pp. 1081–1093. PMLR, 2019.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Valentini-Botinhao et al. (2016) Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Junichi Yamagishi. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In SSW, pp. 146–152, 2016.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Veaux et al. (2013) Christophe Veaux, Junichi Yamagishi, and Simon King. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 O-COCOSDA/CASLRE, pp. 1–4, 2013.
- Vincent et al. (2016) Emmanuel Vincent, Shinji Watanabe, Jon Barker, and Ricard Marxer. The 4th chime speech separation and recognition challenge. URL: http://spandh. dcs. shef. ac. uk/chime challenge Last Accessed on 1 August, 2018, 2016.
- Wang et al. (2024) Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, and Hank Liao. Diarizationlm: Speaker diarization post-processing with large language models. arXiv preprint arXiv:2401.03506, 2024.
- Wang et al. (2023) Siyin Wang, Chao-Han Huck Yang, Ji Wu, and Chao Zhang. Can whisper perform speech-based in-context learning. arXiv preprint arXiv:2309.07081, 2023.
- Watanabe et al. (2018) Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015, 2018.
- Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Wu et al. (2023a) Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, **yu Li, Shujie Liu, Bo Ren, Linquan Liu, et al. On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917, 2023a.
- Wu et al. (2023b) Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe. Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation. arXiv preprint arXiv:2309.17352, 2023b.
- Yang et al. (2021) Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju, Denis Filimonov, and Ivan Bulyko. Multi-task language modeling for improving speech recognition of rare words. In Proc. IEEE ASRU, pp. 1087–1093. IEEE, 2021.
- Yang et al. (2023a) Chao-Han Huck Yang, Yile Gu, Yi-Chieh Liu, Shalini Ghosh, Ivan Bulyko, and Andreas Stolcke. Generative speech recognition error correction with large language models and task-activating prompting. In Proc. IEEE ASRU, 2023a.
- Yang et al. (2023b) Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N Sainath, and Trevor Strohman. From english to more languages: Parameter-efficient model reprogramming for cross-lingual speech recognition. In Proc. ICASSP, pp. 1–5. IEEE, 2023b.
- Yu et al. (2023) Yu Yu, Chao-Han Huck Yang, Jari Kolehmainen, Prashanth G Shivakumar, Yile Gu, Sungho Ryu, Roger Ren, Qi Luo, Aditya Gourav, I-Fan Chen, et al. Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In IEEE Proc. ASRU, 2023.
- Zhang et al. (2023a) Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Zhang et al. (2023b) Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
- Zhao et al. (2021) Long Zhao, Yuxiao Wang, Jia** Zhao, Liangzhe Yuan, Jennifer J Sun, Florian Schroff, Hartwig Adam, Xi Peng, Dimitris Metaxas, and Ting Liu. Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12793–12802, 2021.
- Zhu et al. (2021) Hao Zhu, Huaibo Huang, Yi Li, Aihua Zheng, and Ran He. Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2362–2368, 2021.
Appendix
Appendix A Robust HyPoradise Dataset Details
Domain | Training Set | # Pairs | Length | Test Set | # Pairs | Length | |
Source | Category | ||||||
CHiME-4 | Real-world noise | tr05-real | 9,600 | 17.0 | test-real | 1,320 | 16.4 |
test-simu | 1,320 | 16.4 | |||||
dev-real | 1,640 | 16.8 | |||||
dev-simu | 1,640 | 16.8 | |||||
VB-DEMAND | Unseen noise | train | 23,075 | 7.5 | baby-cry | 824 | 7.7 |
helicopter | |||||||
crowd-party | |||||||
NOIZEUS | Real-world noise | train | 23,807 | 7.1 | babble | 30 | 8.1 |
car | |||||||
station | |||||||
train | |||||||
street | |||||||
airport | |||||||
exhibition | |||||||
restaurant | |||||||
LS-FreeSound | Real-world noise | train | 28,539 | 35.0 | metro | 118 | 17.4 |
car | |||||||
traffic | |||||||
cafe | |||||||
babble | |||||||
ac/vacuum | |||||||
RATS | Radio noise | train | 28,504 | 14.2 | test | 1,000 | 10.2 |
Total | train | 113,525 | 16.8 | test | 10,340 | 13.7 |
A.1 ASR system
For ASR beam search decoding, we employ Whisper Large-V2 (Radford et al., 2023), one large-scale pre-trained model developed by OpenAI to generate N-best hypotheses, which has been reported with several competitive and state-of-the-art performance. Whisper model follows the encoder-decoder Transformer (Vaswani et al., 2017) architecture with 1,550 million parameters, which is trained on 680K hours of multilingual and multitask supervised data collected from the web. As a result, it shows universal and excellent noise-robustness in various conditions though lacks of domain specificity (i.e., still lags behind the specifically trained model on certain dataset).
With such pre-trained ASR model, we employ the beam search algorithm for decoding and generate N-best hypotheses list for each speech sample, where the beam size is set to 50. After removing repetitive utterances, we select top-5 hypotheses in terms of posterior probabilities as N-best list. To develop the RobustHP dataset, we carry out this decoding strategy on multiple noisy ASR corpus (see §A.2) and generate data pairs of 5-best hypotheses and ground-truth transcription.
A.2 Speech Corpus Selection
For speech corpus selection, our goal is to cover common noisy ASR scenarios in real world. Consequently, we collect and simulate the following corpus with evident domain characteristics to compose the Robust HyPoradise dataset:
CHiME-4 (Vincent et al., 2016): CHiME-4 is a popular dataset for far-field noisy speech recognition. It includes real and simulated noisy recordings in four noisy environments, i.e., bus, cafe, pedestrian area, and street junction. We use its tr05-real split (9,600 utterances) to generate RobustHP training data, as well as the test-real (1,320 utterances), test-simu (1,320 utterances), dev-real (1,640 utterances) and dev-simu(1,640 utterances) splits to generate the test data.
VoiceBank-DEMAND (Valentini-Botinhao et al., 2016): VoiceBank-DEMAND is a popular dataset for noise-robust speech recognition and speech enhancement. We use its training data for RobustHP generation, which contains 23,075 noisy utterances from 56 speakers in VoiceBank corpus (Veaux et al., 2013) that are recorded at sampling rate of 16 kHz and mixed with 10 different noise types (babble, cafeteria, car, kitchen, meeting, metro, restaurant, speech-shaped noise, station, traffic) at SNR levels of 0, 5, 10, and 15 dB. For test set, to simulate the challenging unseen noise conditions in practical, we mix the VoiceBank clean test data with three new types of noise (Lin et al., 2021), i.e., baby-cry, helicopter, and crowd-party, at SNR level of 0dB. The test set contains 824 utterances from 2 speakers.
NOIZEUS (Hu & Loizou, 2006): NOIZEUS is a noisy speech corpus developed to evaluate noise-robust speech recognition and speech enhancement algorithms. It only contains a test set of 30 IEEE sentences (produced by 3 male and 3 female speakers) corrupted by 8 different real-world noises at SNR levels of 0, 5, 10, and 15 dB, where we select 5 dB for main experiments. The noise was taken from the AURORA-2 database (Hirsch & Pearce, 2000) that includes suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train-station noise. To match the short length of NOIZEUS test utterances (8.1 tokens in average), we select the clean speech from LibriSpeech train-clean-100 and VoiceBank corpus that with no more than 12 tokens in transcription, and mix them with AURORA-2 noises at SNR levels of 0, 5, 10, 15, and 20 dB to form training set.
LibriSpeech-FreeSound (Prasad et al., 2021): LibriSpeech-FreeSound is a simulated noisy speech corpus for noise-robust speech recognition, which mixes the clean speech data from LibriSpeech train-clean-100 split (Panayotov et al., 2015) and noise data from FreeSound corpus (Font et al., 2013) at SNRs of 0, 5, 10, 15, 20, and 25 dB to form the training set. For test set, they select 118 clean speech samples from LibriSpeech test-clean split and mix them with FreeSound noise at SNRs of 0, 5, 10, 15, and 20 dB, where we select 0 dB for main experiments. Six noise types in FreeSound are employed, including metro, car, traffic, cafe, babble and ac/vacuum.
RATS (Graff et al., 2014): Robust Automatic Transcription of Speech (RATS) dataset contains radio-communication speech in ultra high frequency data category that is extremely noisy and challenging for ASR task. Its training data contains 43,112 noisy speech utterances, where we filter out the low-quality samples (i.e., WER by Whisper is larger than 0.9) to form the training set. Its test set contains 7,591 utterances, where we randomly select 1,000 samples for higher evaluation efficiency.
A.3 Statistics
After performing beam search decoding on the selected speech corpus introduced above, we collect 113K pairs of N-best hypotheses and ground-truth transcription to form the RobustHP dataset. The statistics are presented in Table 6, which illustrates the number of hypotheses-transcription pairs and the average utterance length in various domains and splits. We would release the RobustHP dataset to public upon publication and open the development venue for more data.
Appendix B Method Details
B.1 Denoised LLM Finetuning
B.1.1 Efficient LLM Finetuning: LLaMA-Adapter
![Refer to caption](x5.png)
As presented in Fig. 5, we employ LLaMA-Adapter (Zhang et al., 2023b) for efficient LLM finetuning. Given pre-trained LLM with a -layer Transformer, it inserts a set of learnable adaptation prompts into the top- layers that learn high-level semantics. Denote the prompt for -th Transformer layer as , where denotes the prompt length and denotes the LLM embedding size.
Assume we have tokens containing instruction and already generated response, i.e., , where is the layer index, now we aim to predict the -th token as part of response. In order to finetune the entire system, the learnable adaptation prompt is concatenated with as prefix, i.e., . In this case, the instruction knowledge learned by can guide the to generate the subsequent response under teacher-forcing supervision.
Furthermore, considering the prompt is randomly initialized and thus may disturb the LLM tuning at early training stages, a zero-initialized attention mechanism is designed to mitigate such disturbance. Suppose the LLM is going to generate the -th token based on the prompt and history tokens at -th layer, and we denote the current -th token as . In attention mechanism, there are firstly three projection layers to generate query, key and value, respectively:
(9) |
Thereafter, the attention score between key and value can be formulated as , which captures the correlation between current token and all existed tokens as well as the prompt to predict next token. Therefore, could be split into two parts:
(10) |
where denotes the attention score of adaptation prompts and denotes that of history tokens. Since the adaptation prompts are randomly initialized, their attention scores may cast disturbance on next-token prediction in early training stages. To this end, a learnable gating factor with zero initialization is introduced to adaptively control the importance of prompt in attention, by directly multiplied with its softmax weights from Eq.(10):
(11) |
Finally, the attention output of -th Transformer layer can be calculated with a linear projection:
(12) |
It is then utilized to predict the next token as part of output response. The proposed zero-initialization mechanism achieves an effective trade-off between the pre-trained knowledge of LLM and the learned instructional knowledge through adaptation prompt.
B.1.2 Denoised Adapter Tuning
Apart from text instructions, LLaMA-Adapter is also capable of generating response based on other modality inputs (Zhang et al., 2023b). However, the cross-modal gap between text and other modalities may affect the finetuning stability and performance (Li et al., 2023b). Therefore, we propose to extract a language-space noise embedding in §4.2 to replace audio embedding for representing the noise conditions of source speech, i.e., according to Eq.(9-12), where denotes N-best list size and denotes SBERT embedding size. Then, we incorporate it into LLaMA-Adapter for denoising via element-wise subtraction:
(13) |
where denotes the linear projection tuner introduced in §4.3 for audio noise distillation, the subtraction operation denotes ``denoise''. The is a gating factor to control denoising process. Therefore, the resulted indicates the adaption prompt with language-space denoising, which will replace the in Eq.(9-12) for adapter tuning.
B.2 Audio Noise Distillation
As illustrated in §4.3, the key idea of audio noise distillation is to transfer the real noise information in audio embeddings to our extracted language-space noise embedding, in order to enhance its representation ability of audio noise. The approach we propose is based on mutual information neural estimation (MINE) (Belghazi et al., 2018), which can be split into two stages in Algorithm 1. First, we update the MINE to learn MI estimation, by maximizing the MI between language-space noise embedding and noisy audio embeddings and minimizing the MI between language embedding and clean audio embeddings, i.e., audio noise information exists in noisy speech instead of clean speech. Second, we introduce a learnable tuner to modulate the language-space embedding to include more real noise information by maximizing the MI between it and noisy audio embeddings, which is also jointly optimized with LLM finetuning (i.e., the GER cost function as formulated in Eq.(2)).
The rationale we leverage MINE for distillation instead of other techniques like contrastive learning is due to its strong distinguishing ability, which has been verified by recent applications (Zhu et al., 2021; Zhao et al., 2021; Li et al., 2022). On the other hand, directly employing techniques such as contrastive learning may not work as the language embedding could be far away from the audio-space noisy and clean embeddings, which means the distance between positive and negative samples (i.e., within audio space) is much smaller than the distance between them and the anchor (i.e., between audio and language spaces). Our ablation study in Table 14 also verifies this limitation.
Appendix C Experimental Setup Details
C.1 Model Setups
LLM | LLaMA-2-7b | LLaMA-7b | Falcon-7b | LLaMA-2-13b |
---|---|---|---|---|
Number of Transformer Layers | 32 | 32 | 32 | 40 |
Number of Attention Heads | 32 | 32 | 71 | 40 |
Embedding Size | 4,096 | 4,096 | 4,544 | 5,120 |
Block Size | 4,096 | 2,048 | 2,048 | 4,096 |
Vocabulary Size | 32,000 | 32,000 | 65,024 | 32,000 |
LLMs. We select three latest and popular LLMs for evaluation, including LLaMA-2-7b444https://huggingface.co/meta-llama/Llama-2-7b-hf (Touvron et al., 2023b), LLaMA-7b555https://huggingface.co/yahma/llama-7b-hf (Touvron et al., 2023a), Falcon-7b666https://huggingface.co/tiiuae/falcon-7b (Penedo et al., 2023). In addition, to explore the influence of LLM model size to our approach, we also report some results on LLaMA-2-13b model777https://huggingface.co/meta-llama/Llama-2-13b-hf (Touvron et al., 2023b). Table 7 compares their main configurations.
Adapter. We follow the default setting of LLaMA-Adapter (Zhang et al., 2023b)888https://github.com/Lightning-AI/lit-llama/blob/main/lit_llama/adapter.py,999https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/adapter.py with some modifications. The number of tunable Transformer layers is set to , which means all layers except the first one are tunable with inserted prompts. The prompt length is set to 20 to match the length of that equals to , where is the N-best list size set to 5. To extract the language-space noise embedding from N-best hypotheses, we utilize sentence-BERT101010https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (Reimers & Gurevych, 2019) whose embedding size is 384.
MINE. MINE introduces a statistic network that contains a multi-layer perceptron (MLP) and a Sigmoid activation function to estimate a mutual information value between 0 and 1. It receives two inputs including the Whisper-encoded audio embeddings of size 1280 and the language-space noise embedding of size 384, which are first projected to same hidden dimension and added together, and then go through MLP to generate output of size 1. In particular, to incorporate the modulated noise embedding (with same size as LLM embedding, different from the input language embedding of size 384) into MINE, we design an extra interface to receive it as intermediate features on language-space feature branch. The noise embedding tuner contains a linear projection from the SBERT size of 384 to the LLM embedding size as described in §B.1.2.
C.2 Training and Evaluation Setups
LLM Finetuning. The learning rate is set to for CHiME-4 that is relatively small, and set to for relatively large datasets including VB-DEMAND, NOIZEUS, LS-FreeSound and RATS. The batch size is set to 4, with accumulation iterations set to 8 (e.g., effective batch size is 32). We train 2 epochs with AdamW optimizer (Loshchilov & Hutter, 2018), with weight decay set to 0.02 and warmup steps set to 20% of one epoch's steps. In addition, MINE is updated using an extra AdamW optimizer with learning rate that is 10% of LLM tuning, where all other configurations keep the same. The hyper-parameter in Algorithm 1 is set to 0.5. We use 1 NVIDIA A40 GPU for model training, which takes 1.5 hours for CHiME-4, 2.0 hours for VB-DEMAND, 1.6 hours for NOIZEUS, 4.5 hours for LS-FreeSound, and 3.8 hours for RATS, respectively.
Instruction-following Finetuning. As presented in Fig. 1, we leverage instruction-following finetuning strategy for GER, where we design an instruction template:
``Below is the best-hypotheses transcribed from speech recognition system. Please try to revise it using the words which are only included into other-hypothesis, and write the response for the true transcription.### Best-hypothesis:{1-best hypothesis}### Other-hypothesis:{2N-best hypotheses}### Response:''
We find that different instruction templates would have slight impact on the final GER performance, which is an open question for further discussion. In particular, we design some constraints (e.g., only use the words inside N-best hypotheses list for error correction) to control the quality of response and avoid potential LLM hallucinations (Feldman et al., 2023).
Test Set | Baseline | LM | GER | + Audio Denoising | RobustGER | Oracle | ||
(ours) | ||||||||
CHiME-4 | test-real | |||||||
test-simu | ||||||||
dev-real | ||||||||
dev-simu | ||||||||
avg. | ||||||||
VB-DEMAND | baby-cry | |||||||
helicopter | ||||||||
crowd-party | ||||||||
avg. | ||||||||
NOIZEUS | babble | |||||||
car | ||||||||
station | ||||||||
train | ||||||||
street | ||||||||
airport | ||||||||
exhibition | ||||||||
restaurant | ||||||||
avg. | ||||||||
LS-FreeSound | metro | |||||||
car | ||||||||
traffic | ||||||||
cafe | ||||||||
babble | ||||||||
ac/vacuum | ||||||||
avg. | ||||||||
RATS | test |
Response Generation. In the generation stage, we adopt a temperature of 0.2 and top-1 sampling, i.e., greedy search. We observe the over-confidence phenomenon in our experiments (i.e., output probability distribution for decision is close to one-hot), which results in similar performance with different for top- sampling. Therefore, we select top-1 sampling for higher decoding efficiency.
LM Rescoring Baseline. For baseline in Table 1, we use a Transformer-based LM for typical rescoring, which is trained on the text transcriptions of each RobustHP subset using ESPnet toolkit 111111https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1 (Watanabe et al., 2018). The LM contains 16 Transformer layers with 8 heads and 512 attention units, and it is trained for 25 epochs with Adam optimizer (Kingma & Ba, 2014). The learning rate is set to with 25,000 warm-up steps.
In-context Learning Baseline. We implement an in-context learning baseline for case study in Table 5, which is effective in making full use of LLM's powerful reasoning ability and linguistic knowledge (Dong et al., 2022). In particular, we utilize ChatGPT to conduct GER task using task-activated prompting (TAP) (Yang et al., 2023a): we first prompt ChatGPT to summarize what is ASR and typical LM rescoring, and then inform it the definition of ASR generative error correction, followed by several examples to teach it how to do such kind of error correction. With above background knowledge, we finally ask it to perform GER for our sample in case study.
Details of t-SNE Visualization. Fig. 4 and 6 present the t-SNE visualization of the language and audio noise embeddings. The language embeddings are the outputs of distillation tuner, which are selected from the LS-FreeSound test samples. The audio embeddings are encoder outputs of Whisper ASR model, where the speech samples also come from LS-FreeSound test samples. In particular, for better visualization we employ Stable-Whisper121212https://github.com/jianfch/stable-ts to extract the speech segments of same word ``for'' (i.e., around 5.7s in total from LS-FreeSound test data), as the distance between different phonemes is much larger than that between different noise conditions.
Test Set | Baseline | LM | GER | + Audio Denoising | RobustGER | Oracle | ||
(ours) | ||||||||
CHiME-4 | test-real | |||||||
test-simu | ||||||||
dev-real | ||||||||
dev-simu | ||||||||
avg. | ||||||||
VB-DEMAND | baby-cry | |||||||
helicopter | ||||||||
crowd-party | ||||||||
avg. | ||||||||
NOIZEUS | babble | |||||||
car | ||||||||
station | ||||||||
train | ||||||||
street | ||||||||
airport | ||||||||
exhibition | ||||||||
restaurant | ||||||||
avg. | ||||||||
LS-FreeSound | metro | |||||||
car | ||||||||
traffic | ||||||||
cafe | ||||||||
babble | ||||||||
ac/vacuum | ||||||||
avg. | ||||||||
RATS | test |
Test Set | Baseline | LM | GER | + Audio Denoising | RobustGER | Oracle | ||
(ours) | ||||||||
CHiME-4 | test-real | |||||||
test-simu | ||||||||
dev-real | ||||||||
dev-simu | ||||||||
avg. | ||||||||
VB-DEMAND | baby-cry | |||||||
helicopter | ||||||||
crowd-party | ||||||||
avg. | ||||||||
NOIZEUS | babble | |||||||
car | ||||||||
station | ||||||||
train | ||||||||
street | ||||||||
airport | ||||||||
exhibition | ||||||||
restaurant | ||||||||
avg. | ||||||||
LS-FreeSound | metro | |||||||
car | ||||||||
traffic | ||||||||
cafe | ||||||||
babble | ||||||||
ac/vacuum | ||||||||
avg. | ||||||||
RATS | test |
Appendix D Supplementary Experiments
D.1 Results on Different LLMs
Apart from LLaMA-2-7b, we also evaluate our proposed RobustGER approach on popular LLaMA-7b and Falcon-7b models as illustrated in Table 8 and 9. To further investigate the effect of LLM size on RobustGER, we conduct extra experiments on LLaMA-2-13b in Table 10.
Similar to the results of LLaMA-2-7b in Table 1, our proposed RobustGER achieves consistent gains of performance on various LLMs and testing conditions, which verifies its general effectiveness. On the other hand, there exists some performance difference between different LLMs. In particular, LLaMA-2-13b outperforms all the 7b LLMs due to its larger model capacity and stronger language generation ability. Among 7b models, LLaMA-2-7b outperforms LLaMA-7b and Falcon-7b thanks to larger-scale training data and longer context length.
Noise Type | SNR (dB) | Baseline | LM | GER | + Audio Denoising | RobustGER | Oracle | |
(ours) | ||||||||
Metro | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
Car | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
Traffic | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
Cafe | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
Babble | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
AC/Vacuum | 0 | |||||||
5 | ||||||||
10 | ||||||||
15 | ||||||||
20 | ||||||||
avg. | ||||||||
Clean |
Test set | Baseline | LM | GER | + Audio Denoising | RobustGER | Oracle | |
---|---|---|---|---|---|---|---|
(ours) | |||||||
VB-DEMAND | |||||||
LS-FreeSound |
D.2 Results on Different SNRs
Table 11 reports more results on different-SNR testing conditions. Similar to Table 2, we can observe consistent performance gains of RobustGER over vanilla GER and audio denosing baselines under different noise levels, i.e., ranging from 0 dB (quite noisy) to 20 dB (relatively clean). In addition, RobustGER also surpasses the reranking upper-bound under some testing scenarios, indicating the effectiveness of RobustGER over conventional LM rescoring methods.
Furthermore, we also report error correction results on clean test data from VB-DEMAND and LS-FreeSound datasets, where significant GER improvement with 46.2% and 30.0% relative WER reductions are achieved by RobustGER approach. This experimental evidence demonstrates the excellent generality of RobustGER for various ASR scenarios.
Test Set | Baseline | GER | + Audio Denoising | + Language Denoising | |||
---|---|---|---|---|---|---|---|
LLaMA Emb. | FastText | SBERT | |||||
CHiME-4 | test-real | ||||||
test-simu | |||||||
dev-real | |||||||
dev-simu | |||||||
avg. | |||||||
VB-DEMAND | baby-cry | ||||||
helicopter | |||||||
crowd-party | |||||||
avg. |
Test Set | Baseline | GER | + Lang. Denoising | + Audio Noise Distillation | |||
---|---|---|---|---|---|---|---|
T-S learning | Contra. learning | MINE | |||||
CHiME-4 | test-real | ||||||
test-simu | |||||||
dev-real | |||||||
dev-simu | |||||||
avg. | |||||||
VB-DEMAND | baby-cry | ||||||
helicopter | |||||||
crowd-party | |||||||
avg. |
D.3 Ablation Study of Language Embedding Extractor
Table 13 illustrates the ablation study of proposed language-space noise embedding with different text embedding extractors. First, we try the input word-to-embedding layer in LLaMA-2-7b to extract both utterance- and token-level embeddings in §4.2, which leads to minor gains over audio denosing baseline, indicating that the LLaMA embedding is less discriminative for audio noise modeling. The supervised text classifier FastText (Grave et al., 2018) provides a better solution to extract text embeddings for modeling the N-best list diversity. Benefiting from the powerful global context modeling ability of Transformer (Vaswani et al., 2017), SBERT (Reimers & Gurevych, 2019) presents the best performance for language-space noise embedding extraction, which well represents both utterance- and token-level embeddings as shown in Table 3.
D.4 Ablation Study of Audio Noise Distillation
Table 14 explores different KD approaches for audio noise distillation. The first one is teacher-student learning, which implements distillation by performing KL-divergence regularization between a trainable student and a frozen teacher, but it shows minor gains of performance. In comparison, contrastive learning technique achieves better results by introducing positive vs. negative samples to learn distinctiveness. However, it is still sub-optimal due to the large distance between language and audio spaces, i.e., the anchor (language noise embedding) is far away from the positive (noisy audio embedding) and negative (clean audio embedding) samples that are relatively closer to each other. To this end, our utilized MINE introduces a neural network to estimate and maximize mutual information, which is more direct and effective in manipulating representations in different spaces for knowledge distillation. As a result, MINE achieves the best performance of audio noise distillation.
Noise | SNR (dB) | N-best Hypotheses | Acoustic | WER (%) |
Type | Score | |||
Babble | 0 | i pray for them but that is not the same as i pray for sam | ||
i pray for them but that is not the same as i pray for science | ||||
i pray for them but that is not the same as if i prayed for sam | ||||
i pray for them but that is not the same as i pray for sons | ||||
i pray for them but that is not the same as if i pray for sam | ||||
10 | i pray for you but that is not the same as if you prayed yourself | |||
i pray for you but that is not the same as if you prayed yourself | ||||
i pray for you but that is not the same as if you pray yourself | ||||
i pray for you but that is not the same as if you pray for yourself | ||||
i pray for you but that is not the same as if you prayed for yourself | ||||
AC | 0 | i pray for you but that is not the same as if you prayed yourself | ||
i pray for you but that is not the same as if you pray yourself | ||||
i pray for you but that is not the same as if you pray for yourself | ||||
i would pray for you but that is not the same as if you prayed yourself | ||||
i pray for you but that is not the same as if you prayed for yourself | ||||
10 | i pray for you but that is not the same as if you prayed yourself | |||
i pray for you but that is not the same as if you prayed yourself | ||||
i prayed for you but that is not the same as if you prayed yourself | ||||
i prayed for you but that is not the same as if you prayed yourself | ||||
i prayed for you but that is not the same as if you prayed yourself | ||||
Clean | i pray for you but that is not the same as if you prayed yourself | |||
i pray for you but that is not the same as if you prayed yourself | ||||
i pray for you but that is not the same as if you prayed yourself | ||||
i pray for you but that is not the same as if you prayed yourself | ||||
i pray for you but that is not the same as if you prayed yourself | ||||
Ground Truth | i pray for you but that is not the same as if you prayed yourself | - | - |
![Refer to caption](x6.png)
D.5 Relationship between Noisy Speech and N-best List Diversity
As introduced in §1, our insight of proposing language-space noise embedding to represent audio noise is the relationship between the noise conditions of source speech and the diversity of decoded N-best list from ASR model, i.e., the worse noisy conditions (more challenging noise type or lower SNR), the higher uncertainty of ASR beam search decoding, and thus results in more diverse N-best hypotheses. To verify the reliability of this insight, we present the N-best hypotheses from a speech sample under different noise conditions in Table 15. For Babble noise, we can observe that 0 dB yields higher decoding uncertainty (i.e., lower acoustic scores) than 10 dB, which results in more diverse N-best hypotheses and worse 1-best WER, i.e., more language noise. Similar phenomenon can be observed in AC noise condition. On the other hand, we notice from Table 11 that Babble noise under same SNR level yields worse ASR results than AC noise, which means Babble is a more challenging noise type. As a result, Babble_0dB produces more diverse N-best list than AC_0dB, which is same for Babble_10dB and AC_10dB. In particular, the highly intelligible clean speech yields no N-best diversity. Fig. 6 visualize the language noise that originates from different audio noise, where the distances between clusters well represent the noise levels of source speech.
In summary, the relationship between the audio noise in source speech and the language noise in decoded N-best list inspires us to propose language-space denoising. Fortunately, the powerful generation ability of LLMs promotes the success of this research idea.
Appendix E Limitations
Though effective in improving noisy ASR performance, there still exist some limitations in the proposed RobustGER.
-
•
Table 16 presents a failure case on CHiME-4 dev-real set. There is one error in N-best hypotheses, i.e., the word ``Miss'' that should be ``Ms'' in ground truth. The GER baseline successfully corrects this error while RobustGER fails. The reason could be, the words ``Ms'' (/mIz/) and ``Miss'' (/mIs/) sound similar especially under noisy scenarios, GER cannot distinguish them so it depends on LLMs to decide based on context. Thanks to the rich linguistic knowledge and powerful reasoning ability, LLMs enable GER to generate the correct word ``Ms'' that is more appropriate than ``Miss'' in this context. On the other hand, with the proposed language-space denoising, RobustGER successfully perceives the trivial difference between their pronunciations but find the word is more likely to be ``Miss'' (e.g., maybe the speaker’s pronunciation is not standard). Such information misleads LLMs to generate the wrong word. Therefore, this is a problem of trade-off between contextual information and denoising for LLMs to generate correct transcription: 1) when both homophones suit the context, LLMs should be carefully in denoising to find the correct word (see Table 5), 2) when one of homophones is obviously more suitable to the context than another one, LLMs may not need denoising as it could provide misleading information. We believe this could be a promising research direction for future work on GER.
-
•
We observe from main results in Table 1 that both GER and our RobustGER achieves significantly more improvements on CHiME-4 dataset than other datasets. This phenomenon has been also observed and analyzed in the original GER benchmark (Chen et al., 2023b), as there are many financial terminologies in the transcriptions of CHiME-4 that are relatively easier for LLMs to correct. Therefore, in future we may need a analysis of error types for CHiME-4 to understand how RobustGER works there.
-
•
After our initial draft was released on OpenReview in September 2023, we also learned that there have been recent developments in post-recognition text modeling, as well as LLM based efforts in audio understanding (Gong et al., 2023a; b; Wu et al., 2023b) and speaker diarization (Park et al., 2023; Wang et al., 2024). We hope to align the efforts of different research groups to enable more robust and resilient text modeling evaluations for various speech and audio processing tasks in the future, as part of a collaborative community effort.
Method | Utterance | WER (%) |
N-best List | miss amsterdam declined to comment | |
miss amsterdam declined to comment | ||
ms amsterdam declined to comment | ||
miss amsterdam declined to comment | ||
miss amsterdam decline to comment | ||
GER | ms amsterdam declined to comment | |
RobustGER | miss amsterdam declined to comment | |
Ground Truth | ms amsterdam declined to comment | - |
Clean vs. |
ac | babble | cafe | car | metro | traffic | avg. |
---|---|---|---|---|---|---|---|
Language Noise Emb. |
|||||||
+ Audio Distillation |