Search | arXiv e-print repository

Scaling Laws for Forgetting When Fine-Tuning Large Language Models

Abstract: We study and quantify the problem of forgetting when fine-tuning pre-trained large language models (LLMs) on a downstream task. We find that parameter-efficient fine-tuning (PEFT) strategies, such as Low-Rank Adapters (LoRA), still suffer from catastrophic forgetting. In particular, we identify a strong inverse linear relationship between the fine-tuning performance and the amount of forgetting wh… ▽ More We study and quantify the problem of forgetting when fine-tuning pre-trained large language models (LLMs) on a downstream task. We find that parameter-efficient fine-tuning (PEFT) strategies, such as Low-Rank Adapters (LoRA), still suffer from catastrophic forgetting. In particular, we identify a strong inverse linear relationship between the fine-tuning performance and the amount of forgetting when fine-tuning LLMs with LoRA. We further obtain precise scaling laws that show forgetting increases as a shifted power law in the number of parameters fine-tuned and the number of update steps. We also examine the impact of forgetting on knowledge, reasoning, and the safety guardrails trained into Llama 2 7B chat. Our study suggests that forgetting cannot be avoided through early stop** or by varying the number of parameters fine-tuned. We believe this opens up an important safety-critical direction for future research to evaluate and develop fine-tuning schemes which mitigate forgetting △ Less

Submitted 10 January, 2024; originally announced January 2024.

ACM Class: I.2.7

arXiv:2312.03732 [pdf, other]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Authors: Damjan Kalajdzievski

Abstract: As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This… ▽ More As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters. Consequently, the use of LoRA in practice has generally been limited to very low ranks. In this work, we study the impact of the scaling factor on the learning process and prove that LoRA adapters should be divided by a factor of the square root of the rank. Modifying LoRA with the appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off, where larger ranks can be used to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost. △ Less

Submitted 27 November, 2023; originally announced December 2023.

ACM Class: I.2.7

arXiv:2211.16607 [pdf, other]

Transfer Entropy Bottleneck: Learning Sequence to Sequence Information Transfer

Authors: Damjan Kalajdzievski, Ximeng Mao, Pascal Fortier-Poisson, Guillaume Lajoie, Blake Richards

Abstract: When presented with a data stream of two statistically dependent variables, predicting the future of one of the variables (the target stream) can benefit from information about both its history and the history of the other variable (the source stream). For example, fluctuations in temperature at a weather station can be predicted using both temperatures and barometric readings. However, a challeng… ▽ More When presented with a data stream of two statistically dependent variables, predicting the future of one of the variables (the target stream) can benefit from information about both its history and the history of the other variable (the source stream). For example, fluctuations in temperature at a weather station can be predicted using both temperatures and barometric readings. However, a challenge when modelling such data is that it is easy for a neural network to rely on the greatest joint correlations within the target stream, which may ignore a crucial but small information transfer from the source to the target stream. As well, there are often situations where the target stream may have previously been modelled independently and it would be useful to use that model to inform a new joint model. Here, we develop an information bottleneck approach for conditional learning on two dependent streams of data. Our method, which we call Transfer Entropy Bottleneck (TEB), allows one to learn a model that bottlenecks the directed information transferred from the source variable to the target variable, while quantifying this information transfer within the model. As such, TEB provides a useful new information bottleneck approach for modelling two statistically dependent streams of data in order to make predictions about one of them. △ Less

Submitted 8 March, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: 41 pages, 26 figures

Journal ref: Transactions on Machine Learning Research (TMLR), 2023

arXiv:2012.02306 [pdf, ps, other]

The ultrafilter and almost disjointness numbers

Authors: Osvaldo Guzman, Damjan Kalajdzievski

Abstract: We prove that every MAD family can be destroyed by a proper forcing that preserves $P$-points. With this result, we prove that it is consistent that $ω_{1}=\mathfrak{u}<\mathfrak{a,}$ solving a nearly 20 year old problem of Shelah and a problem of Brendle. We will also present a simple proof of a result of Blass and Shelah that the inequality $\mathfrak{u<s}$ is consistent. We prove that every MAD family can be destroyed by a proper forcing that preserves $P$-points. With this result, we prove that it is consistent that $ω_{1}=\mathfrak{u}<\mathfrak{a,}$ solving a nearly 20 year old problem of Shelah and a problem of Brendle. We will also present a simple proof of a result of Blass and Shelah that the inequality $\mathfrak{u<s}$ is consistent. △ Less

Submitted 4 June, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

MSC Class: 03E17; 03E35

arXiv:1711.11148 [pdf, other]

Forcing and Construction Schemes

Authors: Damjan Kalajdzievski, Fulgencio Lopez

Abstract: We investigate forcing and independence questions relating to construction schemes. We show that adding $κ\geqω_1$ Cohen reals adds a capturing construction scheme. We study the weaker structure of $n$-capturing construction schemes and show that it is consistent to have $n$-capturing construction schemes but no $(n+1)$-capturing construction schemes. We also study the relation of $n$-capturing wi… ▽ More We investigate forcing and independence questions relating to construction schemes. We show that adding $κ\geqω_1$ Cohen reals adds a capturing construction scheme. We study the weaker structure of $n$-capturing construction schemes and show that it is consistent to have $n$-capturing construction schemes but no $(n+1)$-capturing construction schemes. We also study the relation of $n$-capturing with the $m$-Knaster hierarchy and show that MA$_{ω_1}($K$_m)$ and $n$-capturing are independent if $n\leq m$ and incompatible if $n>m$. △ Less

Submitted 19 January, 2018; v1 submitted 29 November, 2017; originally announced November 2017.

Comments: 12 pages, submitted to Acta Math. Hung

MSC Class: 03E05; 03E35; 03E65

arXiv:1205.5819 [pdf, other]

Measurability Aspects of the Compactness Theorem for Sample Compression Schemes

Authors: Damjan Kalajdzievski

Abstract: It was proved in 1998 by Ben-David and Litman that a concept space has a sample compression scheme of size d if and only if every finite subspace has a sample compression scheme of size d. In the compactness theorem, measurability of the hypotheses of the created sample compression scheme is not guaranteed; at the same time measurability of the hypotheses is a necessary condition for learnability.… ▽ More It was proved in 1998 by Ben-David and Litman that a concept space has a sample compression scheme of size d if and only if every finite subspace has a sample compression scheme of size d. In the compactness theorem, measurability of the hypotheses of the created sample compression scheme is not guaranteed; at the same time measurability of the hypotheses is a necessary condition for learnability. In this thesis we discuss when a sample compression scheme, created from com- pression schemes on finite subspaces via the compactness theorem, have measurable hypotheses. We show that if X is a standard Borel space with a d-maximum and universally separable concept class C, then (X,C) has a sample compression scheme of size d with universally Borel measurable hypotheses. Additionally we introduce a new variant of compression scheme called a copy sample compression scheme. △ Less

Submitted 17 July, 2012; v1 submitted 25 May, 2012; originally announced May 2012.

Comments: Latex 2e, 64 pages, 1 figure. This is an M.Sc. thesis defended on July 4'th 2012 at the University of Ottawa, Canada, under the supervision of Dr. V. Pestov, and with examiners Dr. J. Levy and Dr. S. Zilles

Showing 1–6 of 6 results for author: Kalajdzievski, D