-
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Authors:
Klaudia Bałazy,
Mohammadreza Banaei,
Karl Aberer,
Jacek Tabor
Abstract:
The recent trend in scaling language models has led to a growing demand for parameter-efficient tuning (PEFT) methods such as LoRA (Low-Rank Adaptation). LoRA consistently matches or surpasses the full fine-tuning baseline with fewer parameters. However, handling numerous task-specific or user-specific LoRA modules on top of a base model still presents significant storage challenges. To address th…
▽ More
The recent trend in scaling language models has led to a growing demand for parameter-efficient tuning (PEFT) methods such as LoRA (Low-Rank Adaptation). LoRA consistently matches or surpasses the full fine-tuning baseline with fewer parameters. However, handling numerous task-specific or user-specific LoRA modules on top of a base model still presents significant storage challenges. To address this, we introduce LoRA-XS (Low-Rank Adaptation with eXtremely Small number of parameters), a novel approach leveraging Singular Value Decomposition (SVD) for parameter-efficient fine-tuning. LoRA-XS introduces a small r x r weight matrix between frozen LoRA matrices, which are constructed by SVD of the original weight matrix. Training only r x r weight matrices ensures independence from model dimensions, enabling more parameter-efficient fine-tuning, especially for larger models. LoRA-XS achieves a remarkable reduction of trainable parameters by over 100x in 7B models compared to LoRA. Our benchmarking across various scales, including GLUE, GSM8k, and MATH benchmarks, shows that our approach outperforms LoRA and recent state-of-the-art approaches like VeRA in terms of parameter efficiency while maintaining competitive performance.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
A Lightweight Energy Management Method for Hybrid PV/Battery/Load Systems
Authors:
Mohsen Banaei,
Razgar Ebrahimy,
Henrik Madsen
Abstract:
In this paper, a computationally lightweight algorithm is introduced for hybrid PV/Battery/Load systems that is price responsive, responds fast, does not require powerful hardware, and considers the operational limitations of the system. The method is applied to two buildings equipped with PV and battery. Simulation results show that the method can give results that are up to 3.9% more expensive t…
▽ More
In this paper, a computationally lightweight algorithm is introduced for hybrid PV/Battery/Load systems that is price responsive, responds fast, does not require powerful hardware, and considers the operational limitations of the system. The method is applied to two buildings equipped with PV and battery. Simulation results show that the method can give results that are up to 3.9% more expensive than the Model predictive control (MPC) approach while the runtime of the program is up to 1000 times less than the MPC. Also, while the runtime of the proposed method is in the range of the self-consumption maximization (SCM) approach as the fastest method, its electricity cost is about 3.2% cheaper than the SCM method. Simulation results also show that in case of providing grid services by the battery the difference between electricity cost of the proposed approach and MPC can reduce which makes the method good for such applications.
△ Less
Submitted 11 January, 2024;
originally announced January 2024.
-
Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-Attention
Authors:
Negar Foroutan,
Mohammadreza Banaei,
Karl Aberer,
Antoine Bosselut
Abstract:
In this work, we study whether multilingual language models (MultiLMs) can transfer logical reasoning abilities to other languages when they are fine-tuned for reasoning in a different language. We evaluate the cross-lingual reasoning abilities of MultiLMs in two schemes: (1) where the language of the context and the question remain the same in the new languages that are tested (i.e., the reasonin…
▽ More
In this work, we study whether multilingual language models (MultiLMs) can transfer logical reasoning abilities to other languages when they are fine-tuned for reasoning in a different language. We evaluate the cross-lingual reasoning abilities of MultiLMs in two schemes: (1) where the language of the context and the question remain the same in the new languages that are tested (i.e., the reasoning is still monolingual, but the model must transfer the learned reasoning ability across languages), and (2) where the language of the context and the question is different (which we term code-switched reasoning). On two logical reasoning datasets, RuleTaker and LeapOfThought, we demonstrate that although MultiLMs can transfer reasoning ability across languages in a monolingual setting, they struggle to transfer reasoning abilities in a code-switched setting. Following this observation, we propose a novel attention mechanism that uses a dedicated set of parameters to encourage cross-lingual attention in code-switched sequences, which improves the reasoning performance by up to 14% and 4% on the RuleTaker and LeapOfThought datasets, respectively.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
Nash Equilibrium of Joint Day-ahead Electricity Markets and Forward Contracts in Congested Power Systems
Authors:
Mohsen Banaei,
Majid Oloomi Buygi,
Hani Raouf-Sheybani,
Razgar Ebrahimy,
Henrik Madsen
Abstract:
Uncertainty in the output power of large-scale wind power plants (WPPs) can face the electricity market players with undesirable profit variations. Market players can hedge themselves against these risks by participating in forward contracts markets alongside the day-ahead markets. The participation of market players in these two markets affects their profits and also the prices and power quantiti…
▽ More
Uncertainty in the output power of large-scale wind power plants (WPPs) can face the electricity market players with undesirable profit variations. Market players can hedge themselves against these risks by participating in forward contracts markets alongside the day-ahead markets. The participation of market players in these two markets affects their profits and also the prices and power quantities of each market. Moreover, limitations in the transmission grid can affect the optimal behavior of market players. In this paper, a Cournot Nash equilibrium model is proposed to study the behavior of market players in the forward contract market and the day-ahead electricity market in a congested power system with large-scale integration of WPPs. The proposed method is applied to a test system, and the results are discussed.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models
Authors:
Mohammadreza Banaei,
Klaudia Bałazy,
Artur Kasymov,
Rémi Lebret,
Jacek Tabor,
Karl Aberer
Abstract:
Recent transformer language models achieve outstanding results in many natural language processing (NLP) tasks. However, their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning o…
▽ More
Recent transformer language models achieve outstanding results in many natural language processing (NLP) tasks. However, their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model. We challenge the classical matrix factorization methods by proposing a novel, better-performing autoencoder-based framework. We perform a comprehensive ablation study of our approach, examining its different aspects over a diverse set of evaluation settings. Moreover, we show that enabling collaboration between modules across layers by compressing certain modules together positively impacts the final model performance. Experiments on various NLP tasks demonstrate that our approach significantly outperforms commonly used factorization-based offline compression methods.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
Discovering Language-neutral Sub-networks in Multilingual Language Models
Authors:
Negar Foroutan,
Mohammadreza Banaei,
Remi Lebret,
Antoine Bosselut,
Karl Aberer
Abstract:
Multilingual pre-trained language models transfer remarkably well on cross-lingual downstream tasks. However, the extent to which they learn language-neutral representations (i.e., shared representations that encode similar phenomena across languages), and the effect of such representations on cross-lingual transfer performance, remain open questions. In this work, we conceptualize language neutra…
▽ More
Multilingual pre-trained language models transfer remarkably well on cross-lingual downstream tasks. However, the extent to which they learn language-neutral representations (i.e., shared representations that encode similar phenomena across languages), and the effect of such representations on cross-lingual transfer performance, remain open questions. In this work, we conceptualize language neutrality of multilingual models as a function of the overlap between language-encoding sub-networks of these models. We employ the lottery ticket hypothesis to discover sub-networks that are individually optimized for various languages and tasks. Our evaluation across three distinct tasks and eleven typologically-diverse languages demonstrates that sub-networks for different languages are topologically similar (i.e., language-neutral), making them effective initializations for cross-lingual transfer with limited performance degradation.
△ Less
Submitted 30 October, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
AdaGrid: Adaptive Grid Search for Link Prediction Training Objective
Authors:
Tim Poštuvan,
Jiaxuan You,
Mohammadreza Banaei,
Rémi Lebret,
Jure Leskovec
Abstract:
One of the most important factors that contribute to the success of a machine learning model is a good training objective. Training objective crucially influences the model's performance and generalization capabilities. This paper specifically focuses on graph neural network training objective for link prediction, which has not been explored in the existing literature. Here, the training objective…
▽ More
One of the most important factors that contribute to the success of a machine learning model is a good training objective. Training objective crucially influences the model's performance and generalization capabilities. This paper specifically focuses on graph neural network training objective for link prediction, which has not been explored in the existing literature. Here, the training objective includes, among others, a negative sampling strategy, and various hyperparameters, such as edge message ratio which controls how training edges are used. Commonly, these hyperparameters are fine-tuned by complete grid search, which is very time-consuming and model-dependent. To mitigate these limitations, we propose Adaptive Grid Search (AdaGrid), which dynamically adjusts the edge message ratio during training. It is model agnostic and highly scalable with a fully customizable computational budget. Through extensive experiments, we show that AdaGrid can boost the performance of the models up to $1.9\%$ while being nine times more time-efficient than a complete search. Overall, AdaGrid represents an effective automated algorithm for designing machine learning training objectives.
△ Less
Submitted 8 May, 2022; v1 submitted 30 March, 2022;
originally announced March 2022.
-
Direction is what you need: Improving Word Embedding Compression in Large Language Models
Authors:
Klaudia Bałazy,
Mohammadreza Banaei,
Rémi Lebret,
Jacek Tabor,
Karl Aberer
Abstract:
The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression of these models to improve their inference time and memory footprint. This paper presents a novel loss objective to compress token embeddings in the Tr…
▽ More
The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression of these models to improve their inference time and memory footprint. This paper presents a novel loss objective to compress token embeddings in the Transformer-based models by leveraging an AutoEncoder architecture. More specifically, we emphasize the importance of the direction of compressed embeddings with respect to original uncompressed embeddings. The proposed method is task-agnostic and does not require further language modeling pre-training. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity. Moreover, we evaluate our proposed approach over SQuAD v1.1 dataset and several downstream tasks from the GLUE benchmark, where we also outperform the baseline in most scenarios. Our code is public.
△ Less
Submitted 3 August, 2021; v1 submitted 15 June, 2021;
originally announced June 2021.
-
Spoken dialect identification in Twitter using a multi-filter architecture
Authors:
Mohammadreza Banaei,
Rémi Lebret,
Karl Aberer
Abstract:
This paper presents our approach for SwissText & KONVENS 2020 shared task 2, which is a multi-stage neural model for Swiss German (GSW) identification on Twitter. Our model outputs either GSW or non-GSW and is not meant to be used as a generic language identifier. Our architecture consists of two independent filters where the first one favors recall, and the second one filter favors precision (bot…
▽ More
This paper presents our approach for SwissText & KONVENS 2020 shared task 2, which is a multi-stage neural model for Swiss German (GSW) identification on Twitter. Our model outputs either GSW or non-GSW and is not meant to be used as a generic language identifier. Our architecture consists of two independent filters where the first one favors recall, and the second one filter favors precision (both towards GSW). Moreover, we do not use binary models (GSW vs. not-GSW) in our filters but rather a multi-class classifier with GSW being one of the possible labels. Our model reaches F1-score of 0.982 on the test set of the shared task.
△ Less
Submitted 5 June, 2020;
originally announced June 2020.