-
Scaling Laws for Forgetting When Fine-Tuning Large Language Models
Authors:
Damjan Kalajdzievski
Abstract:
We study and quantify the problem of forgetting when fine-tuning pre-trained large language models (LLMs) on a downstream task. We find that parameter-efficient fine-tuning (PEFT) strategies, such as Low-Rank Adapters (LoRA), still suffer from catastrophic forgetting. In particular, we identify a strong inverse linear relationship between the fine-tuning performance and the amount of forgetting wh…
▽ More
We study and quantify the problem of forgetting when fine-tuning pre-trained large language models (LLMs) on a downstream task. We find that parameter-efficient fine-tuning (PEFT) strategies, such as Low-Rank Adapters (LoRA), still suffer from catastrophic forgetting. In particular, we identify a strong inverse linear relationship between the fine-tuning performance and the amount of forgetting when fine-tuning LLMs with LoRA. We further obtain precise scaling laws that show forgetting increases as a shifted power law in the number of parameters fine-tuned and the number of update steps. We also examine the impact of forgetting on knowledge, reasoning, and the safety guardrails trained into Llama 2 7B chat. Our study suggests that forgetting cannot be avoided through early stop** or by varying the number of parameters fine-tuned. We believe this opens up an important safety-critical direction for future research to evaluate and develop fine-tuning schemes which mitigate forgetting
△ Less
Submitted 10 January, 2024;
originally announced January 2024.
-
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
Authors:
Damjan Kalajdzievski
Abstract:
As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This…
▽ More
As large language models (LLMs) have become increasingly compute and memory intensive, parameter-efficient fine-tuning (PEFT) methods are now a common strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA), which adds trainable low-rank "adapters" to selected layers. Each adapter consists of a low-rank matrix product, multiplicatively scaled by a rank-dependent factor. This scaling factor, which divides adapters by a factor of the rank, results in slowed learning and stunted performance for LoRA with higher-rank adapters. Consequently, the use of LoRA in practice has generally been limited to very low ranks. In this work, we study the impact of the scaling factor on the learning process and prove that LoRA adapters should be divided by a factor of the square root of the rank. Modifying LoRA with the appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA) method, easily provides for a fine-tuning compute/performance trade-off, where larger ranks can be used to trade off increased computational resources during training for better fine-tuning performance, with no change in inference computing cost.
△ Less
Submitted 27 November, 2023;
originally announced December 2023.
-
Transfer Entropy Bottleneck: Learning Sequence to Sequence Information Transfer
Authors:
Damjan Kalajdzievski,
Ximeng Mao,
Pascal Fortier-Poisson,
Guillaume Lajoie,
Blake Richards
Abstract:
When presented with a data stream of two statistically dependent variables, predicting the future of one of the variables (the target stream) can benefit from information about both its history and the history of the other variable (the source stream). For example, fluctuations in temperature at a weather station can be predicted using both temperatures and barometric readings. However, a challeng…
▽ More
When presented with a data stream of two statistically dependent variables, predicting the future of one of the variables (the target stream) can benefit from information about both its history and the history of the other variable (the source stream). For example, fluctuations in temperature at a weather station can be predicted using both temperatures and barometric readings. However, a challenge when modelling such data is that it is easy for a neural network to rely on the greatest joint correlations within the target stream, which may ignore a crucial but small information transfer from the source to the target stream. As well, there are often situations where the target stream may have previously been modelled independently and it would be useful to use that model to inform a new joint model. Here, we develop an information bottleneck approach for conditional learning on two dependent streams of data. Our method, which we call Transfer Entropy Bottleneck (TEB), allows one to learn a model that bottlenecks the directed information transferred from the source variable to the target variable, while quantifying this information transfer within the model. As such, TEB provides a useful new information bottleneck approach for modelling two statistically dependent streams of data in order to make predictions about one of them.
△ Less
Submitted 8 March, 2023; v1 submitted 29 November, 2022;
originally announced November 2022.
-
The ultrafilter and almost disjointness numbers
Authors:
Osvaldo Guzman,
Damjan Kalajdzievski
Abstract:
We prove that every MAD family can be destroyed by a proper forcing that preserves $P$-points. With this result, we prove that it is consistent that $ω_{1}=\mathfrak{u}<\mathfrak{a,}$ solving a nearly 20 year old problem of Shelah and a problem of Brendle. We will also present a simple proof of a result of Blass and Shelah that the inequality $\mathfrak{u<s}$ is consistent.
We prove that every MAD family can be destroyed by a proper forcing that preserves $P$-points. With this result, we prove that it is consistent that $ω_{1}=\mathfrak{u}<\mathfrak{a,}$ solving a nearly 20 year old problem of Shelah and a problem of Brendle. We will also present a simple proof of a result of Blass and Shelah that the inequality $\mathfrak{u<s}$ is consistent.
△ Less
Submitted 4 June, 2021; v1 submitted 3 December, 2020;
originally announced December 2020.
-
Forcing and Construction Schemes
Authors:
Damjan Kalajdzievski,
Fulgencio Lopez
Abstract:
We investigate forcing and independence questions relating to construction schemes. We show that adding $κ\geqω_1$ Cohen reals adds a capturing construction scheme. We study the weaker structure of $n$-capturing construction schemes and show that it is consistent to have $n$-capturing construction schemes but no $(n+1)$-capturing construction schemes. We also study the relation of $n$-capturing wi…
▽ More
We investigate forcing and independence questions relating to construction schemes. We show that adding $κ\geqω_1$ Cohen reals adds a capturing construction scheme. We study the weaker structure of $n$-capturing construction schemes and show that it is consistent to have $n$-capturing construction schemes but no $(n+1)$-capturing construction schemes. We also study the relation of $n$-capturing with the $m$-Knaster hierarchy and show that MA$_{ω_1}($K$_m)$ and $n$-capturing are independent if $n\leq m$ and incompatible if $n>m$.
△ Less
Submitted 19 January, 2018; v1 submitted 29 November, 2017;
originally announced November 2017.
-
Measurability Aspects of the Compactness Theorem for Sample Compression Schemes
Authors:
Damjan Kalajdzievski
Abstract:
It was proved in 1998 by Ben-David and Litman that a concept space has a sample compression scheme of size d if and only if every finite subspace has a sample compression scheme of size d. In the compactness theorem, measurability of the hypotheses of the created sample compression scheme is not guaranteed; at the same time measurability of the hypotheses is a necessary condition for learnability.…
▽ More
It was proved in 1998 by Ben-David and Litman that a concept space has a sample compression scheme of size d if and only if every finite subspace has a sample compression scheme of size d. In the compactness theorem, measurability of the hypotheses of the created sample compression scheme is not guaranteed; at the same time measurability of the hypotheses is a necessary condition for learnability. In this thesis we discuss when a sample compression scheme, created from com- pression schemes on finite subspaces via the compactness theorem, have measurable hypotheses. We show that if X is a standard Borel space with a d-maximum and universally separable concept class C, then (X,C) has a sample compression scheme of size d with universally Borel measurable hypotheses. Additionally we introduce a new variant of compression scheme called a copy sample compression scheme.
△ Less
Submitted 17 July, 2012; v1 submitted 25 May, 2012;
originally announced May 2012.