HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: inconsolata

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: CC BY 4.0
arXiv:2312.07028v1 [cs.CL] 12 Dec 2023

Instructions for *ACL Proceedings

Ibtihel Amara11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Vinija Jain2,3†Aman Chadha2,3†
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTMcGill University 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTStanford University  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTAmazon AI

DCS: Dynamic Corrective Self-Distillation for Better Fine-Tuning of Pretrained Models

Ibtihel Amara11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Vinija Jain2,3†Aman Chadha2,3†
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTMcGill University 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTStanford University  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTAmazon AI
Abstract

We tackle the challenging issue of aggressive fine-tuning encountered during the process of transfer learning of pre-trained language models (PLMs) with limited labeled downstream data. This problem primarily results in a decline in performance on the subsequent task. Inspired by the adaptive boosting method in traditional machine learning, we present an effective dynamic corrective self-distillation (DCS) approach to improve the fine-tuning of the PLMs. Our technique involves performing a self-distillation mechanism where, at each iteration, the student model actively adapts and corrects itself by dynamically adjusting the weights assigned to individual data points. This iterative self-correcting process significantly enhances the overall fine-tuning capability of PLMs, leading to improved performance and robustness. We conducted comprehensive evaluations using the GLUE benchmark demonstrating the efficacy of our method in enhancing the fine-tuning process for various PLMs across diverse downstream tasks.

DCS: Dynamic Corrective Self-Distillation for Better Fine-Tuning of Pretrained Models


Ibtihel Amara11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Vinija Jain2,3† Aman Chadha2,3† 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTMcGill University 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTStanford University  33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTAmazon AI

22footnotetext: Work does not relate to position at Amazon.

1 Introduction

There has been a remarkable advancement in the field of Natural Language Processing (NLP) in the past few years, thanks to the introduction of pre-trained language models (PLMs). Recent PLMs like BERT Devlin et al. (2019), RoBERTa Zhuang et al. (2021), XLNet Yang et al. (2019) have revolutionized and shaped the landscape of the field of NLP by demonstrating significant progress across different downstream applications such as machine translation, reading comprehension, and question and answering. Standard practice involves fine-tuning PLMs directly on labeled data from these tasks. Yet, when faced with limited downstream data, known as aggressive fine-tuning Jiang et al. (2020), the risk of model overfitting and reduced generalization capacity emerges. Addressing this challenge has spurred various approaches, encompassing hyper-parameter tuning heuristics Howard and Ruder (2018); Peters et al. (2019), additional layer integration Houlsby et al. (2019); Stickland and Murray (2019), improved training strategies Chen et al. (2020), and noise-induced fine-tuning methods Wu et al. (2022). Notably, techniques such as adapters, hypernetworks, LoRA, QLoRA, and GLoRA have gained traction for efficient parameter updates when data is scarce Houlsby et al. (2019); Hu et al. (2021); Dettmers et al. (2023); Chavan et al. (2023).

Refer to caption
Figure 1: Dynamic Corrective Self-Distillation Framework. DCS iteratively adjusts the weights on the data samples at every epoch. It puts more emphasis on teacher-student discordant (i.e., non-agreeing) samples. We refer to agreement between teacher and student as the level of consensus in their predictions. The distillation component ensures the proper guidance of the student network throughout its optimization process.

Surprisingly, the potential of distillation as a fine-tuning tool has been somewhat overlooked. Despite its simplicity, distillation offers substantial utility in guiding and optimizing fine-tuning processes. In this work, we introduce Dynamic Corrective Self-distillation (DCS), a straightforward approach inspired by adaptive boosting in machine learning. DCS iteratively adjusts sample weights, prioritizing instances where the teacher and student model disagrees. This emphasis on challenging examples helps improve individual model performance and accuracy in subsequent iterations.

  • We propose a flexible and adaptable fine-tuning framework called DCS based on distillation-boosting fusion, which helps large PLMs in avoiding overfitting problems.

  • We introduce a self-corrective training framework where the student model is guided by the knowledge of a teacher network.

  • Our proposed technique yields a gain in performance of more than 2% on the GLUE benchmark when compared to vanilla fine-tuning.

Model MNLI QQP RTE QNLI MRPC CoLA SST STS Avg.
BERT (base) Devlin et al. (2019) 84.4 90.9 67.7 91.5 87.1 58.1 93.0 89.4 82.76
BERT (base) + DCS 84.8 91.2 71.8 91.8 87.7 59.7 93.2 89.5 83.71
RoBERTa (base) Wu et al. (2022) 87.5 91.7 77.1 92.7 90.1 62.9 94.5 90.8 85.91
RoBERTa (base) + DCS 87.6 91.8 81.2 93.0 90.7 63.4 94.2 90.9 86.60
XLNET Yang et al. (2019) 86.6 91.2 72.9 91.6 88.1 59.6 94.4 89.6 84.25
XLNET + DCS 86.7 91.5 74.4 91.7 89.2 60.5 95.1 88.5 84.70
ELECTRA Clark et al. (2020) 88.4 91.7 75.2 92.9 88.2 64.2 94.9 90.1 85.70
ELECTRA + DCS 88.7 92.0 84.5 93.5 90.4 70.4 95.2 91.1 88.22
Table 1: Comparison between DCS and Vanilla fine-tuning applied to widely used Pretrained Language Models. The best results are in bold. The results show that DCS leads to substantial improvements across all tasks and among the various PLMs. All DCS values represent the mean values over 3 random seeds.
Methods CoLA RTE MRPC STS-B Avg ΔΔ\Deltaroman_Δ
Vanilla FT 63.13 70.18 90.77 89.61 78.42 0.00
Weight Decay 63.63 71.99 90.93 89.82 79.09 +0.67
Top-K FT 62.63 70.90 91.09 89.97 78.65 +0.23
Mixout 63.60 72.12 91.29 89.99 79.26 +0.84
RecAdam 64.33 71.63 90.85 89.86 79.17 +0.75
R3F 64.13 72.28 91.18 89.61 79.30 +0.88
CHILD FT(F) 63.71 72.02 91.22 90.18 79.29 +0.87
CHILD FT(D) 64.92 73.14 91.42 90.18 79.92 +1.50
DCS (Ours) 64.38 74.36 91.58 89.95 80.06 +1.65
Table 2: Comparison between DCS with other existing fine-tuning methods Daumé III (2007); Houlsby et al. (2019); Lee et al. (2020); Chen et al. (2020); Aghajanyan et al. (2021); Xu et al. (2021). Most values are taken from Xu et al. (2021). ΔΔ\Deltaroman_Δ refers to the performance gain w.r.t VFT. Our findings show DCS performs comparably or even better than other fine-tuning methods.

2 Dynamic Corrective Self-Distillation

To address the issue of limited data for finetuning, we draw inspiration from the adaptive boosting technique in machine learning and propose a dynamic process that iteratively adjusts the sample weights at each epoch of the training process. To achieve accurate weight adjustments, we rely on the knowledge distillation technique Hinton et al. (2015). This technique involves employing a teacher network to guide a student model with the same architecture and capacity. In Figure 1, we provide a visual representation of the DCS framework. Initially, DCS assigns equal weights to each data point within the downstream dataset. However, as the learning process progresses, DCS modifies these weights. Specifically, greater emphasis is placed on samples where the student and teacher networks diverge in their predictions. In other words, samples that the student network mispredicts compared to the teacher’s prediction are assigned higher weights. By incorporating this mechanism, DCS effectively leverages the discrepancy between the student and teacher networks to enhance its learning capability. KD training is performed according to the following objective function.

LTotal=αLCE+(1α)LKDsubscript𝐿𝑇𝑜𝑡𝑎𝑙𝛼subscript𝐿𝐶𝐸1𝛼subscript𝐿𝐾𝐷L_{Total}=\alpha L_{CE}+(1-\alpha)L_{KD}italic_L start_POSTSUBSCRIPT italic_T italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_α italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT (1)
LKD=i=1Np(xi)log(q(xi))subscript𝐿𝐾𝐷superscriptsubscript𝑖1𝑁𝑝subscript𝑥𝑖𝑙𝑜𝑔𝑞subscript𝑥𝑖\vspace{-1.5mm}L_{KD}=-\sum_{i=1}^{N}p(x_{i})log(q(x_{i}))\vspace{1mm}italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_o italic_g ( italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (2)

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the class probabilities generated by the teacher and student For the DCS technique, each sample weight is adjusted according to the teacher and student’s agreement. For this, the LKDsubscript𝐿𝐾𝐷L_{KD}italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT is replaced with the following:

LKD=i=1Nwip(xi)log(q(xi))subscript𝐿𝐾𝐷superscriptsubscript𝑖1𝑁subscript𝑤𝑖𝑝subscript𝑥𝑖𝑙𝑜𝑔𝑞subscript𝑥𝑖\displaystyle L_{KD}=-\sum_{i=1}^{N}w_{i}p(x_{i})log(q(x_{i}))italic_L start_POSTSUBSCRIPT italic_K italic_D end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_l italic_o italic_g ( italic_q ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (3)

such that wi=λsubscript𝑤𝑖𝜆w_{i}=\lambdaitalic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_λ if y^iTy^iSsuperscriptsubscript^𝑦𝑖𝑇superscriptsubscript^𝑦𝑖𝑆\hat{y}_{i}^{T}\neq\hat{y}_{i}^{S}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≠ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, otherwise wi=1subscript𝑤𝑖1w_{i}=1italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. λ𝜆\lambdaitalic_λ is a hyper-parameter higher than 1 (λ>>1)much-greater-than𝜆1(\lambda>>1)( italic_λ > > 1 ). y^iTsuperscriptsubscript^𝑦𝑖𝑇\hat{y}_{i}^{T}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and y^iSsuperscriptsubscript^𝑦𝑖𝑆\hat{y}_{i}^{S}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT are the predictions of the teacher and the student, respectively, for a sample point xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We provide the steps for performing DCS in Algorithm A.1.
How can self-KD help finetune the student? Recent studies by Furlanello et al. Furlanello et al. (2018) and Stanton et al. Stanton et al. (2021) have highlighted the potential of self-distillation, where the students created through this process can actually surpass the performance of their teachers. This intriguing outcome stems from the fact that distillation inherently involves simplifying the student’s model based on the teacher’s knowledge. The crucial point here is that it doesn’t need an exact match between the student and the teacher. In fact, striving for a perfect match would hinder the student’s ability to learn and outperform the teacher. These research works Stanton et al. (2021); Furlanello et al. (2018) study the significance of not merely mimicking the teacher but rather distilling and imparting vital knowledge in a manner that amplifies the student’s performance.
Weighing discordant samples. Introducing sample weighting has been demonstrated to enhance the KD process Lu et al. (2021). More specifically, DCS assigns increased weights to the discordant samples, amplifying their impact in the subsequent stages of the student model training. This technique steers the student model toward challenging samples by leveraging the teacher’s soft labels while concurrently adhering to the guidance provided by the hard labels (i.e., the ground truth labels).

3 Experiments and Results

3.1 Experimental Setup

We conduct our experiments on various datasets from the GLUE benchmark Wang et al. (2018), a set of datasets containing different tasks such as natural language inference, sentiment analysis, and sentence similarity. Following previous research works, we fine-tune our pretrained models on the training set and directly report results on the dev set. We use the pretrained models and codes provided by HuggingFace Wolf et al. (2020). Appendix A.3 elaborates on our hyperparameter sweep setup.

3.2 Results

Comparison to Vanilla Fine-Tuning. We present compelling results of DCS on eight GLUE tasks using four commonly used PLMs. Following the evaluation methodology of related works Xu et al. (2021); Wu et al. (2022), we compare the performance of DCS against vanilla fine-tuning. Table 1 clearly demonstrates that DCS substantially outperforms vanilla fine-tuning across all tasks, affirming its superiority and potential for enhancing PLM fine-tuning. On average, DCS achieves an approximate 1% improvement across all GLUE benchmark tasks using the BERT base model, and a substantial 2.5% average score increase on ELECTRA. Notably, DCS shows significant performance gains on smaller downstream tasks as well. The RTE dataset, the smallest in the GLUE benchmark with only 2.5K samples, exhibits a remarkable boost with DCS. Across various PLMs, we observe notable increases, such as nearly 8% with ELECTRA and approximately 4% with RoBERTa and BERT for RTE.
Comparison to Existing FT Methods. In this section, we assess the performance and the effectiveness of our proposed fine-tuning technique DCS. For this, we follow the work in Xu et al. (2021) and proceed to compare prior methods on the BERT-large model and report the mean values across different seeds. Results of this comparison are shown in Table 2. We observe that all fine-tuning techniques improve upon vanilla fine-tuning. Our technique, which is considered a simple and straightforward technique compared to the existing methods, is as good as or even substantially outperforms these fine-tuning schemes. Mainly we observe a similar trend as in Table 1, where DCS depicts a significant increase on RTE (the smallest dataset within the GLUE benchmark).

Refer to caption
Figure 2: Effect of different weighting strategies of DCS on the performance. Results show that leveraging the discordant teacher-student samples benefits our DCS fine-tuning framework.

3.3 Ablation and Analyses

Model and Technique RTE CoLA MRPC
BERT DCS 71.11 59.61 87.58
BERT DCS (w/o weighting) 70.39 58.38 87.25
ELECTRA DCS 84.47 70.57 90.49
ELECTRA DCS (w/o weighting) 83.75 70.06 90.19
Table 3: Ablation study of DCS. Without weighting involves distillation without the dynamic self-correction component. Removing both weighting and distillation components leads to vanilla fine-tuning. Results emphasize the importance of combining both components for enhancing the performance of our fine-tuning framework.

Components of DCS. The DCS technique comprises two key components that synergistically enhance its effectiveness: (1) The sample re-weighting mechanism, serving as a self-corrective element within DCS, and (2) The offline distillation component, which plays a dual role. Firstly, it identifies samples that require increased attention, and secondly, it continues to guide the student network during the learning process. To assess the significance of these components, we conducted an ablation study, the results of which are presented in Table 3. When removing the self-corrective mechanism (referred to as DCS without weighting), we transition to sole distillation. However, this leads to a noticeable performance decrease compared to DCS. On average, we observe a reduction of more than 1% in RTE performance for both BERT base (from 71.11% downto 70.39%) and ELECTRA (from 84.47% downto 83.75%) models. Furthermore, eliminating the distillation component returns us to the vanilla fine-tuning technique. As previously observed, DCS consistently outperforms vanilla fine-tuning, highlighting the superiority of our approach.
Leveraging Disagreement. In this section, we investigate the benefits of assigning higher weights to discordant student-teacher samples in DCS and how different weighting strategies impact its performance. Figure 2 presents a comparison of various weighting strategies: DCS-reverse, which prioritizes concordant teacher-student samples, and DCS-random, which assigns weights randomly. Analyzing the performance of BERT-base on different datasets, we observe that leveraging the discrepancies between teacher and student networks improves the performance and robustness of DCS. Mainly, DCS-reverse performs the least, suggesting that focusing on concordant samples hampers generalization and performance. Conversely, emphasizing non-agreeing samples between the teacher and student enhances the student’s understanding and potential for improved generalization, resulting in performance gains.

3.4 Hyperparameter Sensitivity

Influence of the hyperparameter α𝛼\alphaitalic_α. In KD, α𝛼\alphaitalic_α handles the amount of contribution for each of the losses (i.e., cross-entropy and KD). We run different experiments with different values of α𝛼\alphaitalic_α. As shown in Figure 3, DCS shows higher performance for smaller to intermediate values of α𝛼\alphaitalic_α, which suggests that the student fine-tuned model benefits from the distillation component of the optimization process. Relevant details can be found in A.2.

Refer to caption
Figure 3: Sensitivity of DCS to α𝛼\alphaitalic_α. The results highlight the importance of teacher guidance through distillation. Higher α𝛼\alphaitalic_α values indicate a stronger influence on the KD loss, while lower values emphasize the distillation loss. Interestingly, moderate alpha values lead to significant performance improvements, while further increases in alpha show decreasing values.

4 Conclusion

In this paper, we introduce DCS, a framework for enhancing the fine-tuning of pre-trained language models (PLMs) on downstream tasks. By incorporating a dynamic self-correction mechanism triggered via self-distillation, DCS achieves remarkable performance gains over vanilla fine-tuning and previous approaches. Through extensive experiments on various tasks from the GLUE benchmark, we demonstrate the superior performance of DCS across four different pre-trained language models. Notably, DCS not only outperforms existing methods but also significantly enhances the generalization ability of the fine-tuned models. One of the main advantages of using our proposed method DCS is flexibility. Unlike other existing methods such as parameter-efficient fine-tuning (PEFT) methods and noise addition, DCS does not limit the flexibility of the model and its ability to adapt to each downstream task.

5 Limitations

One main limitation of the DCS method is its computational complexity. DCS involves training a teacher model (of similar size as the student network) using the direct fine-tuning method on the downstream tasks. Then, we use this teacher to better guide a student model using the adaptive corrective approach via knowledge distillation. However, we argue that teacher models (i.e., vanilla fine-tuned) are abundant, meaning that we can easily find a directly fine-tuned teacher network in public repositories like Hugging face hubs. Therefore, one may not need to consider this as a limitation.

References

  • Aghajanyan et al. (2021) Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal Gupta. 2021. Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations.
  • Chavan et al. (2023) Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. 2023. One-for-all: Generalized lora for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967.
  • Chen et al. (2020) Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. 2020. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7870–7881, Online. Association for Computational Linguistics.
  • Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  • Daumé III (2007) Hal Daumé III. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263, Prague, Czech Republic. Association for Computational Linguistics.
  • Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Furlanello et al. (2018) Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  • Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
  • Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Jiang et al. (2020) Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020. SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2177–2190, Online. Association for Computational Linguistics.
  • Lee et al. (2020) Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. 2020. Mixout: Effective regularization to finetune large-scale pretrained language models. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Lu et al. (2021) Peng Lu, Abbas Ghaddar, Ahmad Rashid, Mehdi Rezagholizadeh, Ali Ghodsi, and Philippe Langlais. 2021. Rw-kd: Sample-wise loss terms re-weighting for knowledge distillation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3145–3152.
  • Peters et al. (2019) Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. 2019. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 7–14, Florence, Italy. Association for Computational Linguistics.
  • Stanton et al. (2021) Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. 2021. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34:6906–6919.
  • Stickland and Murray (2019) Asa Cooper Stickland and Iain Murray. 2019. Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pages 5986–5995. PMLR.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Wu et al. (2022) Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2022. NoisyTune: A little noise can help you finetune pretrained language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 680–685, Dublin, Ireland. Association for Computational Linguistics.
  • Xu et al. (2021) Runxin Xu, Fuli Luo, Zhiyuan Zhang, Chuanqi Tan, Baobao Chang, Songfang Huang, and Fei Huang. 2021. Raise a child in large language model: Towards effective and generalizable fine-tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9514–9528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  • Zhuang et al. (2021) Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, Huhhot, China. Chinese Information Processing Society of China.

Frequently Asked Questions (FAQs)

  • [leftmargin=4.5mm]

  • How do you determine the optimal teacher network to guide the student model effectively during the process of distillation?

    The DCS framework does not require a specific type of teacher model or high-performing teacher model. Any pre-fine-tuned teacher model can be used within the DCS framework. In our experiments, we opted for self-distillation, where the teacher and student networks have similar architectures and sizes. The performance of the teacher model does not need to be exceptional or achieve high accuracy. It simply needs to be trained for a few iterations without full convergence. In our experiments, we fine-tuned the teachers for a limited number of epochs (2 epochs). This flexibility in the choice of teacher models allows for broader applicability and ease of implementation within the DCS framework.

  • How can we expect the student network to achieve better results on downstream tasks if we choose a teacher network that is not performing well?

    Knowledge distillation can still be beneficial for the student network’s optimization process. The main purpose of knowledge distillation is to transfer valuable insights learned by the teacher down to the student Furlanello et al. (2018); Stanton et al. (2021). Although the teacher network did not achieve high performance on its own, it still provides valuable patterns and special relationships in the data. By considering this distilled knowledge, the student has the ability to improve its performance and achieve better results on the downstream task. One main thing to note, is that in our DCS framework, we perform a weighted distillation in which we give higher weights on discordant samples. The latter can be one of the following scenarios: (a) The teacher network correctly classified the sample whereas the student misclassified the sample, or (b) vice versa. In this case, the student network gets to attend to both scenarios and learn from both the good and the bad throughout its learning process, which contributes to its effectiveness and robustness. Furthermore, it is worth mentioning that our student networks receive guidance not solely from the soft labels provided by the teacher but also from the hard labels known as the ground truth labels. This dual guidance alongside the re-weighting strategy serves as an effective regularization strategy.

  • What type of knowledge are you using for distilling information from teacher to student?

    In DCS, we solely use response-based knowledge. We mainly exploit the logits (i.e., the raw pre-softmax outputs) to align the knowledge between the teacher and the student networks. Our main goal is to propose a straightforward and less complicated framework for fine-tuning. Therefore, we opted for this type of knowledge rather than feature based distillation, which can bring more computational complications to the overall framework. In addition, by aligning the logits, the student model can gain more insights into the finer details of the task guided by the teacher model.

  • DCS is inspired from the adaptive boosting technique in machine learning, which in the end combines multiple weak learners. Do you do something similar?

    For this current version of our work, the same network is being adjusted (i.e., dynamic self-correction) according to the new weighted samples. For inference, we use a single network (i.e., the student network). It is possible to combine multiple versions of our student (i.e. different student model checkpoints) during inference to have a more robust predictions and outcomes.

Appendix A Appendix

This section provides supplementary material in the form of additional results, implementation details, etc. to bolster the reader’s understanding of the concepts presented in this work.

A.1 DCS Algorithm

Algorithm 1 DCS
1:Load teacher weights θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
2:Initialize student’s parameters θSsubscript𝜃𝑆\theta_{S}italic_θ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
3:for each epoch do
4:     if epoch == 0 then
5:         Train the student according to Eq.1
6:            using the KD loss in Eq. 2
7:     else
8:         Re-weight the samples according to
9:            the agreement between the teacher
10:            and the current state of the student
11:         Train the student according to Eq.1
12:            using the KD loss in Eq. 3
13:     end if
14:end for

The pseudo-code for our DCS framework is outlined in Algorithm 1. A key requirement is to have a pre-trained teacher model with similar architecture and size as the student network. While the algorithm shares similarities with offline knowledge distillation for downstream tasks, the key differentiating factor is the incorporation of the re-weighting mechanism. The latter plays a crucial role in identifying discordant teacher-student samples and assigning them higher weight values, allowing the student network to benefit from the valuable insights provided by these samples. It is important to note that, the first epoch of training in DCS is intentionally designed to exclude the re-weighting mechanism. This is based on the assumption that the student network has not yet encountered and learned the complexity and intricacies of the downstream data. By providing this initial epoch as a warm-up phase, the DCS framework enables the student network to gradually adapt to the complexities of the downstream task, leading to improved performance and generalization capabilities.

A.2 Influence of the λ𝜆\lambdaitalic_λ Hyperparameter

Model normal-↓\downarrow / λ𝜆\lambdaitalic_λ normal-→\rightarrow 1 2 3 4 5
BERT-DCS 70.39 71.11 70.50 70.81 70.63
Table 4: Influence of the hyper-parameter λ𝜆\lambdaitalic_λ on the performance of DCS on the RTE task.

DCS relies on the re-weighting strategy for self-correction. We perform experiments with different lambda values and assess their effect on the performance of DCS. Table 4 shows that the optimal value is λ=2𝜆2\lambda=2italic_λ = 2. Beyond that, we observe a steady outcome on the DCS.

A.3 Hyperparameter Search Values

In this paper, we fine-tune different large pre-trained language models. We performed a simple grid search to find the optimal hyperparameter settings as shown in Table 5.

Hyperparameters Search Values
Batch size {8, 16}
Learning rate (lr) {1e-5, 2e-5, 1e-6, 2e-6}
Distillation temperature (T) 1
α𝛼\alphaitalic_α KD {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}
sample weighting (λ𝜆\lambdaitalic_λ) {2, 3, 4, 5, 6}
Table 5: Hyperparameter search space for all tasks in GLUE.