\interspeechcameraready\name

[affiliation=1]SalahZaiem \name[affiliation=2]TitouanParcollet \name[affiliation=1]SlimEssid

Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations

Abstract

Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in low-resource scenarios. This work explores middle-ground solutions, conjecturing that reducing the forgetting of the self-supervised task during the downstream fine-tuning leads to better generalization. To prove this, focusing on speech recognition, we benchmark different continual-learning approaches during fine-tuning and show that they improve both in-domain and out-of-domain generalization abilities. Relative performance gains reach 15.7%percent15.715.7\%15.7 % and 22.5%percent22.522.5\%22.5 % with XLSR used as the encoder on two English and Danish speech recognition tasks. Further probing experiments show that these gains are indeed linked to less forgetting.

keywords:
Self-supervised learning (SSL), continual learning, speech recognition.

1 Introduction

Self-supervised representations are often used in low-resource scenarios where downstream data can consist of only a handful of hours of annotated audio [1, 2]. During evaluation and benchmarking, self-supervised encoders are generally kept frozen[3, 4]. However, in common uses, the encoder weights are also fine-tuned. On the one hand, freezing the self-supervised representations during downstream training makes the SSL backbone a mere feature extractor. In this case, to reach reasonable performance, the downstream head may need to be more complex leading to costly inferences [5]. On the other hand, full fine-tuning of the SSL encoder reduces the pretraining to a superior network initialization.

In this work, we postulate that fine-tuning the whole network weights hurts the generalization abilities of the final obtained model, because the model may “forget” what has been learned in the pretraining phase. While this has been proven in text-related applications [6, 7], it has not been explicitly studied in previous speech-processing works. The main argument for the postulate is that models generally learn to solve the self-supervision on massive unlabeled datasets. This large data diversity makes these models robust to distributional shifts and explains in large part their generalization abilities across different domains [8] and should thus be kept after the fine-tuning.

Forgetting issues have been tackled in the Continual Learning (CL) literature. CL is a machine learning paradigm that focuses on training models to acquire new knowledge, adapting to new tasks and data, without losing prior abilities [9]. This work explores the use of continual-learning methods during downstream training, kee**, even after fine-tuning, pretraining task abilities. Continual learning approaches have been explored lately for speech processing towards including, within a given model scope, new languages [10, 11], new accents [12, 13] or new speakers [14] without losing previous abilities. For self-supervision purposes, it has been used to further train an SSL model to include new domains where the learned representations can be efficiently used for downstream training [15]. However, these works never explored CL usage during fine-tuning.

A close line of work is parameter-efficient fine-tuning (PEFT) [16]. While reducing the number of parameters updated is done mainly for the sake of efficiency in the case of large pretrained models, it also leads to less forgetting through freezing large parts of the network [17]. Those methods are widely adopted in the natural language processing and computer vision communities due to the large size of the models [18]. The PEFT speech-related literature is more scarce even if positive results were obtained for child-directed speech [19] or emotion recognition [20]. One close effort has tried various adapter-based methods for self-supervised models on a group of speech tasks [21]. However, it focuses on training efficiency, while we show its impact on performance, especially on out-of-distribtion (OOD) testing samples.

Finally, it is important to note that in our case, while we use methods inspired by continual learning, downstream task performance, in and out-of-domain, is the only objective. We do not evaluate the methods proposed on their forgetting reduction capacity as in the classic CL literature. The link between forgetting and performance is only probed in a second time. The contributions of this work are, thus, three-fold:

  1. 1.

    We explore several continual-learning-based approaches for speech SSL fine-tuning showing substantial performance both on in-domain and out-of-domain testing samples. In particular, as far as we know, we are the first to explore replay options in this context. (Sections 2 and 3).

  2. 2.

    We highlight the link between the performance gain and the non-forgetting of the self-supervised task by probing the forgetting of the best-performing methods. (Section 4).

  3. 3.

    We release the SpeechBrain-based [22] 111github.com/salah-zaiem/CL_based_FTcode to enable reproduction of our work and further investigations.

2 Methods

This section introduces the downstream fine-tuning methods we compare in this work. After describing a few classic baseline approaches, it moves to continual-learning-inspired ones, divided into two groups: freezing and replay-based approaches.

2.1 Baselines

Four baselines, inspired by common practice, are considered in this work. The first one is the vanilla full fine-tuning of the network with the downstream task loss. For the second baseline, the convolutional front-end, often called ”acoustic feature extractor” is frozen during fine-tuning. It is more common than full fine-tuning in the speech self-supervision literature as masking in the self-supervised pretraining happens generally after the convolutional front-end [23].

The third baseline targets a better initialization of the downstream head by kee** the encoder frozen during the first steps of downstream training [24]. In a second time, after the initialization of the downstream head, the weights of the encoder are unfrozen, and the whole model is fine-tuned. This method will be called “two-phased” in the following and has been shown, in computer vision, to lead to better generalization [24]. The final baseline consists in freezing the SSL encoder during the whole downstream finetuning. In this case and following common practices in frozen SSL benchmarking [4, 5], the layer input to the downstream head is a weighted sum of the encoder layers, with weights learned during fine-tuning.

2.2 Freezing-Based Approaches

This section presents a group of tested fine-tuning methods grouped under the “freezing-based” title as they consist in freezing totally or partially a group of weights learned in the pretraining phase. Among these, we present and test three methods.

Adapters. The first one uses adapters within the encoder layers [13]. Adapters are lightweight modules intervening after the dense layers that come after self-attention. Precisely, instead of feeding to the next encoder layer the output of the feed-forward layer following the attention, this output is passed through the adapter and summed to itself as in residual approaches. Only the weights of the adapters are changed during the fine-tuning.

LoRa. Second, following successful trends in natural language processing and computer vision, we test Low-Rank [18] (LoRa) fine-tuning. It consists of freezing the pretrained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture, reducing the number of trainable parameters for downstream tasks. Precisely, we replace the feed-forward layers after the self-attention mechanism with LoRA layers. The initial matrix W0d×ksubscript𝑊0superscript𝑑𝑘W_{0}\in\mathcal{R}^{d\times k}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT is replaced with a low-rank decomposition W0+ΔWsubscript𝑊0Δ𝑊W_{0}+\Delta Witalic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W with ΔW=BAΔ𝑊𝐵𝐴\Delta W=BAroman_Δ italic_W = italic_B italic_A where Bd×r𝐵superscript𝑑𝑟B\in\mathcal{R}^{d\times r}italic_B ∈ caligraphic_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and Ar×k𝐴superscript𝑟𝑘A\in\mathcal{R}^{r\times k}italic_A ∈ caligraphic_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT with r𝑟ritalic_r the rank of the low-rank matrix and d,k𝑑𝑘d,kitalic_d , italic_k the dimensions of latent representations.

Elastic Weight Consolidation. Finally, we explore using Elastic Weight Consolidation (EWC) [25] during fine-tuning. EWC fine-tuning implies an additional loss, during downstream training, that forces the weights of the final model to be closer to those at the end of the pretraining phase. For every updated parameter, the distance to the initial phase is penalized by the corresponding Fisher information value. The loss becomes:

(θ)=DS(θ)+iλ2Fi(θiθi)2𝜃subscript𝐷𝑆𝜃subscript𝑖𝜆2subscript𝐹𝑖superscriptsubscript𝜃𝑖superscriptsubscript𝜃𝑖2\mathcal{L}(\theta)=\mathcal{L}_{DS}(\theta)+\sum_{i}\frac{\lambda}{2}F_{i}% \left(\theta_{i}-\theta_{i}^{*}\right)^{2}caligraphic_L ( italic_θ ) = caligraphic_L start_POSTSUBSCRIPT italic_D italic_S end_POSTSUBSCRIPT ( italic_θ ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (1)

with θ𝜃\thetaitalic_θ the parameters of the SSL model, DSsubscript𝐷𝑆\mathcal{L}_{DS}caligraphic_L start_POSTSUBSCRIPT italic_D italic_S end_POSTSUBSCRIPT the downstream loss, θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the frozen SSL model weights after pretraining, F𝐹Fitalic_F the Fisher information matrix and λ𝜆\lambdaitalic_λ a weighting hyper-parameter. The Fisher information matrix [26] captures how sensible is the pretraining loss to a given parameter. Thus, the loss above penalizes the movement of the most important parameters to the self-supervision task, leading to less forgetting.

2.3 Replay-Based Approaches

We also explore replay methods, often called “experience replay” [27] in the continual learning literature, during the fine-tuning of self-supervised representations. Replaying the pretraining task explicitly enforces non-forgetting through optimizing for simultaneously low SSL and downstream losses. We noticed that the replay loss should have a lower weight than the downstream one for optimal performance. So, for faster training, instead of weighting the replay loss, at every training step, we sample a random variable YU(0,1)similar-to𝑌𝑈01Y\sim U(0,1)italic_Y ∼ italic_U ( 0 , 1 ) and load a replay batch with a probability pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT to perform the replay task. The fine-tuning loss becomes :

(θ)=DS,XDS(θ)+𝟙(Y<pR)SSL,XR(θ)𝜃subscript𝐷𝑆subscript𝑋𝐷𝑆𝜃double-struck-𝟙𝑌subscript𝑝𝑅subscript𝑆𝑆𝐿subscript𝑋𝑅𝜃\mathcal{L}(\theta)=\mathcal{L}_{DS,X_{DS}}(\theta)+\mathbb{1}(Y<p_{R})% \mathcal{L}_{SSL,X_{R}}(\theta)caligraphic_L ( italic_θ ) = caligraphic_L start_POSTSUBSCRIPT italic_D italic_S , italic_X start_POSTSUBSCRIPT italic_D italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) + blackboard_𝟙 ( italic_Y < italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_L , italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θ ) (2)

with DSsubscript𝐷𝑆\mathcal{L}_{DS}caligraphic_L start_POSTSUBSCRIPT italic_D italic_S end_POSTSUBSCRIPT the downstream ASR loss, SSLsubscript𝑆𝑆𝐿\mathcal{L}_{SSL}caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_L end_POSTSUBSCRIPT the self-supervision loss and pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT the replay frequency. Every loss is associated in the formula to the source dataset of the batch it is performed on, with XDSsubscript𝑋𝐷𝑆X_{DS}italic_X start_POSTSUBSCRIPT italic_D italic_S end_POSTSUBSCRIPT the downstream dataset and XRsubscript𝑋𝑅X_{R}italic_X start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT the replay one.

Method English Training Danish Training
GS Test LS test-clean LS test-other Scottish Welsh CV Mean English NST Test CV Mean Danish
Data2Vec Base Baselines
Frozen 33.38 17 22.81 38.05 33.22 56.12 33.43 70.35 83.57 76.96
Full FT 26.92 9.83 17.47 26.9 22.32 53.4 26.14 13.75 36.57 25.16
Fixed CNN 26.67 10.01 16.94 25.52 22.65 49.98 25.30 13.8 34.38 24.09
Two-Phase 26.67 10.14 17.71 26.28 23.65 49.1 25.59 14.63 36.56 25.60
Freezing-Based
LoRa 25.74 9.27 15.73 25.18 21.88 50.81 24.76 12.89 31.13 22.01
EWC 25.57 9.4 16.3 24.97 21.08 50.11 24.57 12.95 31.70 22.33
Adapters 30.62 12.81 19.72 35.16 30.8 56.42 30.92 45.48 62.43 53.96
Replay
LS-Replay 26.07 9.71 16.34 25.14 20.37 48.35 24.33 12.93 32.36 22.64
Auto-Replay 26.25 9.54 17.16 25.8 22.91 50.48 25.36 13.14 35.93 24.54
XLSR-53 Baselines
Frozen >>> 100 >>>100 >>>100 >>>100 >>>100 >>>100 N/A >>>100 >>>100 N/A
Full FT 28.85 11.89 24.43 32.35 28.42 60.69 31.10 10.99 30.41 20.07
Fixed CNN 28.98 12 24.35 33.49 29.42 58.88 31.10 10.8 27.87 19.34
Two-Phase 27.42 10.97 21.66 30.23 25.08 56.19 28.59 11.21 28.94 20.07
Freezing-Based
LoRa 26.68 10.73 19.79 28.61 24.02 50.83 26.78 10.37 24.7 17.54
EWC 27.21 10.55 20.14 29.58 27.02 51.12 27.60 10.35 24.44 17.40
Adapters 28.8 12.76 20.3 29.05 26.36 50.61 27.98 18.85 33.34 26.10
Replay
LS-Replay 27.54 10.85 20.21 29.15 27.53 53.98 28.21 9.29 23.56 16.43
Auto-Replay 28.6 11.53 22.75 31.08 28.52 53.17 29.28 11.22 29.48 20.35
Table 1: WER Results on different test sets using two different SSL backbone encoders; Data2Vec Base and Wav2Vec2 XLSR. The English fine-tuning is performed on the GigaSpeech “XS” subset and the Danish one on 50505050 hours of the NST dataset.

3 Experiments and Results

This section outlines first the experimental details from dataset choices to used hyperparameters and configurations. It, then, presents and describes the obtained results.

3.1 Datasets

The selected downstream sets are of reduced sizes as this work explores fine-tuning options in low-resource scenarios. We will evaluate our methods in two languages, English and Danish. For the English sets, GigaSpeech [28] XS subset (10 hours) will be used for training instead of LibriSpeech [29] as the latter is in the pretraining sets, prohibiting proper forgetting considerations. The testing sets include the GigaSpeech test set, LibriSpeech test splits (test-clean and test-other), two datasets of Scottish and Welsh English accents [30] and CommonVoice 14.0 English [31] test set. The last three sets can be seen as the OOD testing samples as they present different accents and noise conditions. One may consider that OOD testing samples only underline a poor alignment between training and testing data. We argue that, in low-resource scenarios, with generally poor speaker and noise diversity, OOD use cases are almost inevitable, justifying our special interest in OOD performance.

For Danish, we use for training the NST Danish ASR Database 222nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/. It consists of read speech samples recorded in very similar conditions, thus enabling large possibilities for out-of-domain testing. 50505050 hours of the dataset, from multiple speakers, are randomly selected for training, 5555 for validation and 10101010 for in-domain testing. The exact splits are released within the code repository. The CommonVoice 14.0 Danish validation and testing splits are concatenated and used for OOD testing.

3.2 Self-Supervised Models

Two self-supervised models are considered, Data2Vec Base [32] and XLSR-53 [33]. They offer variability in network size (90M parameters for the former and 317M for the latter), pretraining dataset diversity and size, and finally training loss and methods. Data2Vec Base is only trained on the LibriSpeech training splits, grou** 960960960960 hours of English read speech, while XLSR-53 is trained on a total of 56k hours of speech data covering 53 languages.

3.3 Methods Parameters

This section gives the training details for the different continual-learning-based fine-tuning methods proposed in this work. Baselines. The only hyperparameter for the baselines concerns the length of the freezing phase in the “two-phased” approach, we fix it to 3 epochs.

Freezing Based. We use the LoRaLib toolkit [18] to replace the feed-forward layers following the transformer with a LoRa layer with rank r𝑟ritalic_r as described in Section 2. We chose r=16𝑟16r=16italic_r = 16 as in previous works on PEFT [20]. For adapter architectures, we follow previous works in ASR [34], with a bottleneck linear layer followed by an upsampling one. Finally, applying Elastic Weight Consolidation requires two choices. First, we fix the hyper-parameter controlling the distance to the original model loss (see Equation 1) to λ=50𝜆50\lambda=50italic_λ = 50. The Fisher information values are estimated on the LibriSpeech 10h split [35].

Replay Based. During the replay-based experiments, the pretraining tasks, masked latent prediction for Data2Vec, and contrastive predictive coding for XLSR are performed along with the ASR downstream one, as described in Section 2. During the first epoch of fine-tuning, no replay is done as it has been shown to lead to more stable fine-tunings. In the next epochs, fixing the replay probability to pR=0.25subscript𝑝𝑅0.25p_{R}=0.25italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = 0.25 led to the best results. The hyper-parameters of the replay task, mainly controlling the mask creation, are kept similar to the default ones used for the pretraining. Finally, replay requires the choice of a replay dataset. In the following, we will call “auto-replay” experiments where the fine-tuning dataset is also used for the replay episodes. In a second experiment, either for English or Danish fine-tunings, replay batches will be sampled from LibriSpeech train splits, as they are included in the training sets of Data2vec Base and XLSR-53. We call this experiment “LS-replay”.

Refer to caption
Figure 1: Effect of different hyper-parameters on the final performance on Danish in-domain (ID, left y-axis) and out-of-domain (OOD, right y-axis) test sets, for three different techniques (LoRa, EWC, and LS-Replay), with XLSR backbone. While LoRa seems quite robust to changes in the main hyperparameter, always remaining under the baseline, other approaches require careful tuning. In the second and third plots, the fine-tuning baseline is shown for x=0𝑥0x=0italic_x = 0, while it is shown with horizontal dashed lines for the LoRa plot.

3.4 Speech Recognition Settings

Two fully connected layers, with a hidden size of 1,02410241,0241 , 024 map each frame vector to one of the considered characters. The whole model is fine-tuned using the Connectionist Temporal Classification (CTC) loss as the downstream loss. During inference, greedy decoding is applied to the CTC probability outputs without any language-model-based re-scoring following the SpeechBrain recipe [22]. More details are available in the accompanying GitHub repository.

Refer to caption
Figure 2: Evolution of the self-supervision task loss for 4 considered techniques on two English test sets with Data2Vec backbone. The best-performing approaches on the ASR task are the ones best-performing at the SSL task after the fine-tuning.

3.5 Results

Table 1 shows the Word Error Rates (WER) obtained in the English and Danish experiments, with Data2Vec for the upper part and XLSR in the bottom part, as the backbone self-supervised representation model. Results for the English training, i.e. the training performed on GigaSpeech XS, are shown on the five test sets described in Section 3.1, while results for the Danish one are shown on two test sets. Every number shown is the mean of three runs with three different random seeds.

Baselines. As expected, the frozen model leads to poor performance. It is the worst-performing approach for both languages and with both Data2Vec or XLSR. Even worse, the model is not able to fit with XLSR with frozen features. The two classic baselines, freezing the convolutional front-end and the two-phased training, seem to perform better than the full fine-tuning baseline, especially for out-of-domain samples. We can see for instances in Table 1 an absolute gain of 3.5%percent3.53.5\%3.5 % WER and 2.3%percent2.32.3\%2.3 % on CommonVoice English and Danish with the “fixed CNN” approach compared to the full fine-tuning approach.

Freezing-based. When considering the lower parts of the two blocks, presenting the alternative fine-tuning approaches results, we can see that, except for the failing “adapters” approach, all the methods lead to better performances, both for in-domain and out-of-domain testing cases. This is visible from the numbers in bold in the table, as for every test set, the best performance is systematically obtained from one of the proposed alternatives. For instance, Low-Rank fine-tuning, while also being more efficient during training, achieves a mean error rate 7.0%percent7.07.0\%7.0 % lower with Data2Vec and even 14.2%percent14.214.2\%14.2 % lower with Wav2Vec2 XLSR. In a few settings, we failed to achieve reasonable performance with adapters and leave exploration of better adapter options for future works.

Replay-based. The replay-based approaches show two rows, “LS-replay” and “Auto-replay”, as described in Section 3.3, depending on the replay dataset, either LibriSpeech (LS) or the fine-tuning set. In all our settings, with both SSL backbones and on both target languages, replaying LibriSpeech samples, instead of target ones, leads to lower WERs. “LS-replay” is even the best overall performing approach in two cases, Scottish and Welsh accented samples with Data2Vec Base and all Danish test sets with XLSR. The second case is surprising as the downstream and replay data are in different languages.

OOD Generalization. The two CommonVoice (CV) columns allow us to have a proper look at out-of-domain generalization. CV is a crowd-collected dataset showing various accents and recording conditions. This explains in part the high WER values in these columns. Compared to the full fine-tuning baseline, different freezing or replay-based approaches, allow a relative gain in performance that can reach 9.4%percent9.49.4\%9.4 % and 14.8%percent14.814.8\%14.8 % with Data2Vec Base, respectively for English and Danish. Relative gains even reach 15.7%percent15.715.7\%15.7 % and 22.5%percent22.522.5\%22.5 % for Wav2Vec2 XLSR.

4 Analysis and Discussion

This section examines, first, the link between gains in performance and pretraining forgetting. Second, it discusses the sensitivity to hyperparameters of the proposed approaches.

4.1 Probing the Forgetting

To diagnose the link between the performance gains and non-forgetting of the self-supervision part, a checkpoint of the model at every epoch of fine-tuning is saved for probing. We compute, on a selected test set, the Data2Vec self-supervised task loss obtained with the fine-tuned model, at every saved model checkpoint. Data2Vec is trained to predict the latent representations of masked speech parts. During the probing, we mask parts of the input after the convolutional front-end, and see whether the model recovers the latent representations of a non-masked audio. The self-supervision loss is computed on two English test sets, the in-domain test split of GigaSpeech and the OOD CV.

We choose to show the three best-performing approaches, LoRa, EWC, and LS-Replay, along with the best baseline, the two-phased fine-tuning in Figure 2. The two-phased baseline shows an outlying behavior with a loss value twice as high as the “LS-replay” and 6666 to 7777 times as high as for LoRa and EWC, after the frozen start of 3 epochs. This probing experiment seems to confirm the starting postulate with less forgetting being correlated with higher performance.

4.2 Sensitivty to Hyperparameters

As discussed in Sections 2 and 3.3, the presented fine-tuning approaches introduce various hyperparameters and choices. We highlight their influence by reporting the results of a group of experiments related to the tuning of these hyperparameters for the best-performing set of techniques. We consider for low-rank fine-tuning, EWC and “LS-replay” the most impacting hyperparameter: respectively, the rank r𝑟ritalic_r of the LoRa layers, the λ𝜆\lambdaitalic_λ parameter controlling the weight of the distance penalization in EWC and the frequency pRsubscript𝑝𝑅p_{R}italic_p start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT of replay episodes during fine-tuning. We show the results with different values of these hyperparameters on the in-domain (NST) and out-of-domain (CommonVoice) testing samples for the Danish training performed with XLSR-53 as the SSL backbone in Figure 1.

For LoRA, the final performance is not severely impacted by reasonable changes in the main hyperparameter, the rank of the Lora layer. However, this is not the case for the two other techniques, as shown clearly with the “V” shapes of the plots in the second and third columns of Figure 1. With inappropriate values, word error rates are higher than the full FT baseline.

5 Conclusion

This work tests continual-learning-inspired fine-tuning approaches for self-supervision-based speech recognition. Results show that LoRA fine-tuning, EWC, and replay allow substantial gains compared to the full fine-tuning baseline, reaching 14.8%percent14.814.8\%14.8 % and 22.5%percent22.522.5\%22.5 % on Danish ASR, with two different SSL encoders. These gains are correlated with less forgetting, i.e., better performance on the pretraining task after fine-tuning.

References

  • [1] J. Shi, D. Berrebbi, W. Chen, E.-P. Hu, W.-P. Huang, H.-L. Chung, X. Chang, S.-W. Li, A. Mohamed, H. yi Lee, and S. Watanabe, “ML-SUPERB: Multilingual Speech Universal PERformance Benchmark,” in Proc. INTERSPEECH 2023, 2023, pp. 884–888.
  • [2] J. Zhao and W.-Q. Zhang, “Improving automatic speech recognition performance for low-resource languages with self-supervised models,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1227–1241, 2022.
  • [3] S. Zaiem, Y. Kemiche, T. Parcollet, S. Essid, and M. Ravanelli, “Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?” in Proc. INTERSPEECH 2023, 2023, pp. 2873–2877.
  • [4] S. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “Superb: Speech processing universal performance benchmark,” 2021.
  • [5] S. Zaiem, Y. Kemiche, T. Parcollet, S. Essid, and M. Ravanelli, “Speech self-supervised representations benchmarking: a case for larger probing heads,” 2023.
  • [6] S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu, “Recall and learn: Fine-tuning deep pretrained language models with less forgetting,” in EMNLP, Online, Nov. 2020, pp. 7870–7881.
  • [7] T. He, J. Liu, K. Cho, M. Ott, B. Liu, J. Glass, and F. Peng, “Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models,” in EACL, Apr. 2021.
  • [8] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song, “Using self-supervised learning can improve model robustness and uncertainty,” Advances in neural information processing systems, vol. 32, 2019.
  • [9] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning with neural networks: A review,” Neural Networks, vol. 113, pp. 54–71, 2019.
  • [10] L. D. Libera, P. Mousavi, S. Zaiem, C. Subakan, and M. Ravanelli, “Cl-masr: A continual learning benchmark for multilingual asr,” 2023.
  • [11] W. Hou, H. Zhu, Y. Wang, J. Wang, T. Qin, R. Xu, and T. Shinozaki, “Exploiting adapters for cross-lingual low-resource speech recognition,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, p. 317–329, dec 2022.
  • [12] V. A. Trinh, P. Ghahremani, B. King, J. Droppo, A. Stolcke, and R. Maas, “Reducing Geographic Disparities in Automatic Speech Recognition via Elastic Weight Consolidation,” in Proc. Interspeech 2022, 2022, pp. 1298–1302.
  • [13] S. Majumdar, S. Acharya, V. Lavrukhin, and B. Ginsburg, “Damage control during domain adaptation for transducer based automatic speech recognition,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 130–135.
  • [14] A. Diwan, C.-F. Yeh, W.-N. Hsu, P. Tomasello, E. Choi, D. Harwath, and A. Mohamed, “Continual learning for on-device speech recognition using disentangled conformers,” in ICASSP.   IEEE, jun 2023.
  • [15] J.-H. Lee, C.-W. Lee, J.-S. Choi, J.-H. Chang, W. K. Seong, and J. Lee, “CTRL: Continual Representation Learning to Transfer Information of Pre-trained for WAV2VEC 2.0,” in Proc. Interspeech 2022, 2022, pp. 3398–3402.
  • [16] S. Otake, R. Kawakami, and N. Inoue, “Parameter efficient transfer learning for various speech processing tasks,” in IEEE ICASSP, 2023, pp. 1–5.
  • [17] W. Ren, X. Li, L. Wang, T. Zhao, and W. Qin, “Analyzing and reducing catastrophic forgetting in parameter efficient tuning,” 2024.
  • [18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022.
  • [19] R. Fan, Y. Zhu, J. Wang, and A. Alwan, “Towards better domain adaptation for self-supervised models: A case study of child asr,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1242–1252, 2022.
  • [20] T. Feng and S. Narayanan, “Peft-ser: On the use of parameter efficient transfer learning approaches for speech emotion recognition using pre-trained speech models,” 06 2023.
  • [21] Z.-C. Chen, C.-L. Fu, C.-Y. Liu, S.-W. D. Li, and H.-y. Lee, “Exploring efficient-tuning methods in self-supervised speech models,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 1120–1127.
  • [22] M. Ravanelli, T. Parcollet, A. Rouhe, P. Plantinga, E. Rastorgueva, L. Lugosch, N. Dawalatabad, C. Ju-Chieh, A. Heba, F. Grondin, W. Aris, C.-F. Liao, S. Cornell, S.-L. Yeh, H. Na, Y. Gao, S.-W. Fu, C. Subakan, R. De Mori, and Y. Bengio, “Speechbrain,” https://github.com/speechbrain/speechbrain, 2021.
  • [23] S. Zaiem, R. Algayres, T. Parcollet, S. Essid, and M. Ravanelli, “Fine-tuning strategies for faster inference using speech self-supervised models: A comparative study,” in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2023, pp. 1–5.
  • [24] A. Kumar, A. Raghunathan, R. Jones, T. Ma, and P. Liang, “Fine-tuning can distort pretrained features and underperform out-of-distribution,” ICLR, 2022.
  • [25] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” PNAS, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [26] R. A. Fisher, “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London, A, vol. 222, pp. 309–368, 1922.
  • [27] D. Rolnick, A. Ahuja, J. Schwarz, T. Lillicrap, and G. Wayne, “Experience replay for continual learning,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [28] G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, M. **, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y. Wang, Z. You, and Z. Yan, “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” in Proc. Interspeech 2021, 2021, pp. 3670–3674.
  • [29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  • [30] I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open-source Multi-speaker Corpora of the English Accents in the British Isles,” in LREC, May 2020, pp. 6532–6541.
  • [31] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in LREC, May 2020.
  • [32] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning.   PMLR, 2022, pp. 1298–1312.
  • [33] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 2426–2430.
  • [34] S. V. Eeckt and H. Van Hamme, “Using adapters to overcome catastrophic forgetting in end-to-end automatic speech recognition,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  • [35] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP.   IEEE, 2020, pp. 7669–7673.