Search | arXiv e-print repository

Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Authors: Kunjal Panchal, Nisarg Parikh, Sunav Choudhary, Lijun Zhang, Yuriy Brun, Hui Guan

Abstract: Finetuning large language models (LLMs) in federated learning (FL) settings has become important as it allows resource-constrained devices to finetune a model using private data. However, finetuning LLMs using backpropagation requires excessive memory (especially from intermediate activations) for resource-constrained devices. While Forward-mode Auto-Differentiation (AD) can reduce memory footprin… ▽ More Finetuning large language models (LLMs) in federated learning (FL) settings has become important as it allows resource-constrained devices to finetune a model using private data. However, finetuning LLMs using backpropagation requires excessive memory (especially from intermediate activations) for resource-constrained devices. While Forward-mode Auto-Differentiation (AD) can reduce memory footprint from activations, we observe that directly applying it to LLM finetuning results in slow convergence and poor accuracy. This work introduces Spry, an FL algorithm that splits trainable weights of an LLM among participating clients, such that each client computes gradients using Forward-mode AD that are closer estimates of the true gradients. Spry achieves a low memory footprint, high accuracy, and fast convergence. We theoretically show that the global gradients in Spry are unbiased estimates of true global gradients for homogeneous data distributions across clients, while heterogeneity increases bias of the estimates. We also derive Spry's convergence rate, showing that the gradients decrease inversely proportional to the number of FL rounds, indicating the convergence up to the limits of heterogeneity. Empirically, Spry reduces the memory footprint during training by 1.4-7.1$\times$ in contrast to backpropagation, while reaching comparable accuracy, across a wide range of language tasks, models, and FL settings. Spry reduces the convergence time by 1.2-20.3$\times$ and achieves 5.2-13.5\% higher accuracy against state-of-the-art zero-order methods. When finetuning Llama2-7B with LoRA, compared to the peak memory usage of 33.9GB of backpropagation, Spry only consumes 6.2GB of peak memory. For OPT13B, the reduction is from 76.5GB to 10.8GB. Spry makes feasible previously impossible FL deployments on commodity mobile and edge devices. Source code is available at https://github.com/Astuary/Spry. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2211.15281 [pdf, other]

Flow: Per-Instance Personalized Federated Learning Through Dynamic Routing

Authors: Kunjal Panchal, Sunav Choudhary, Nisarg Parikh, Lijun Zhang, Hui Guan

Abstract: Personalization in Federated Learning (FL) aims to modify a collaboratively trained global model according to each client. Current approaches to personalization in FL are at a coarse granularity, i.e. all the input instances of a client use the same personalized model. This ignores the fact that some instances are more accurately handled by the global model due to better generalizability. To addre… ▽ More Personalization in Federated Learning (FL) aims to modify a collaboratively trained global model according to each client. Current approaches to personalization in FL are at a coarse granularity, i.e. all the input instances of a client use the same personalized model. This ignores the fact that some instances are more accurately handled by the global model due to better generalizability. To address this challenge, this work proposes Flow, a fine-grained stateless personalized FL approach. Flow creates dynamic personalized models by learning a routing mechanism that determines whether an input instance prefers the local parameters or its global counterpart. Thus, Flow introduces per-instance routing in addition to leveraging per-client personalization to improve accuracies at each client. Further, Flow is stateless which makes it unnecessary for a client to retain its personalized state across FL rounds. This makes Flow practical for large-scale FL settings and friendly to newly joined clients. Evaluations on Stackoverflow, Reddit, and EMNIST datasets demonstrate the superiority in prediction accuracy of Flow over state-of-the-art non-personalized and only per-client personalized approaches to FL. △ Less

Submitted 10 February, 2024; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: 37th Annual Conference on Neural Information Processing Systems (NeurIPS), 2023

arXiv:2208.06882 [pdf, other]

CoShNet: A Hybrid Complex Valued Neural Network using Shearlets

Authors: Manny Ko, Ujjawal K. Panchal, Héctor Andrade-Loarca, Andres Mendez-Vazquez

Abstract: In a hybrid neural network, the expensive convolutional layers are replaced by a non-trainable fixed transform with a great reduction in parameters. In previous works, good results were obtained by replacing the convolutions with wavelets. However, wavelet based hybrid network inherited wavelet's lack of vanishing moments along curves and its axis-bias. We propose to use Shearlets with its robust… ▽ More In a hybrid neural network, the expensive convolutional layers are replaced by a non-trainable fixed transform with a great reduction in parameters. In previous works, good results were obtained by replacing the convolutions with wavelets. However, wavelet based hybrid network inherited wavelet's lack of vanishing moments along curves and its axis-bias. We propose to use Shearlets with its robust support for important image features like edges, ridges and blobs. The resulting network is called Complex Shearlets Network (CoShNet). It was tested on Fashion-MNIST against ResNet-50 and Resnet-18, obtaining 92.2% versus 90.7% and 91.8% respectively. The proposed network has 49.9k parameters versus ResNet-18 with 11.18m and use 52 times fewer FLOPs. Finally, we trained in under 20 epochs versus 200 epochs required by ResNet and do not need any hyperparameter tuning nor regularization. Code: https://github.com/Ujjawal-K-Panchal/coshnet △ Less

Submitted 29 October, 2022; v1 submitted 14 August, 2022; originally announced August 2022.

Comments: 16 pages, 11 figures

arXiv:2011.10916

Hierachical Delta-Attention Method for Multimodal Fusion

Authors: Kunjal Panchal

Abstract: In vision and linguistics; the main input modalities are facial expressions, speech patterns, and the words uttered. The issue with analysis of any one mode of expression (Visual, Verbal or Vocal) is that lot of contextual information can get lost. This asks researchers to inspect multiple modalities to get a thorough understanding of the cross-modal dependencies and temporal context of the situat… ▽ More In vision and linguistics; the main input modalities are facial expressions, speech patterns, and the words uttered. The issue with analysis of any one mode of expression (Visual, Verbal or Vocal) is that lot of contextual information can get lost. This asks researchers to inspect multiple modalities to get a thorough understanding of the cross-modal dependencies and temporal context of the situation to analyze the expression. This work attempts at preserving the long-range dependencies within and across different modalities, which would be bottle-necked by the use of recurrent networks and adds the concept of delta-attention to focus on local differences per modality to capture the idiosyncrasy of different people. We explore a cross-attention fusion technique to get the global view of the emotion expressed through these delta-self-attended modalities, in order to fuse all the local nuances and global context together. The addition of attention is new to the multi-modal fusion field and currently being scrutinized for on what stage the attention mechanism should be used, this work achieves competitive accuracy for overall and per-class classification which is close to the current state-of-the-art with almost half number of parameters. △ Less

Submitted 28 November, 2022; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: Need to update the results

arXiv:2011.02579 [pdf, other]

Improved Algorithm for Seamlessly Creating Infinite Loops from a Video Clip, while Preserving Variety in Textures

Authors: Kunjal Panchal

Abstract: This project implements the paper "Video Textures" by Szeliski. The aim is to create a "Moving Picture" or as we popularly call it, a GIF; which is "somewhere between a photograph and a video". The idea is to input a video which has some repeated motion (the texture), such as a flag waving, rain, or a candle flame. The output is a new video that infinitely extends the original video in a seamless… ▽ More This project implements the paper "Video Textures" by Szeliski. The aim is to create a "Moving Picture" or as we popularly call it, a GIF; which is "somewhere between a photograph and a video". The idea is to input a video which has some repeated motion (the texture), such as a flag waving, rain, or a candle flame. The output is a new video that infinitely extends the original video in a seamless way. In practice, the output isn't really infinte, but is instead looped using a video player and is sufficiently long as to appear to never repeat. Our goal from this implementation was to: improve distance metric by switching from a crude sum of squared distance to most sophisticated wavelet-based distance; add intensity normalization, cross-fading and morphing to the suggested basic algorithm. We also experiment on the trade-off between variety and smoothness. △ Less

Submitted 4 November, 2020; originally announced November 2020.

arXiv:2010.15985 [pdf, other]

Differential Privacy and Natural Language Processing to Generate Contextually Similar Decoy Messages in Honey Encryption Scheme

Authors: Kunjal Panchal

Abstract: Honey Encryption is an approach to encrypt the messages using low min-entropy keys, such as weak passwords, OTPs, PINs, credit card numbers. The ciphertext is produces, when decrypted with any number of incorrect keys, produces plausible-looking but bogus plaintext called "honey messages". But the current techniques used in producing the decoy plaintexts do not model human language entirely. A gib… ▽ More Honey Encryption is an approach to encrypt the messages using low min-entropy keys, such as weak passwords, OTPs, PINs, credit card numbers. The ciphertext is produces, when decrypted with any number of incorrect keys, produces plausible-looking but bogus plaintext called "honey messages". But the current techniques used in producing the decoy plaintexts do not model human language entirely. A gibberish, random assortment of words is not enough to fool an attacker; that will not be acceptable and convincing, whether or not the attacker knows some information of the genuine source. In this paper, I focus on the plaintexts which are some non-numeric informative messages. In order to fool the attacker into believing that the decoy message can actually be from a certain source, we need to capture the empirical and contextual properties of the language. That is, there should be no linguistic difference between real and fake message, without revealing the structure of the real message. I employ natural language processing and generalized differential privacy to solve this problem. Mainly I focus on machine learning methods like keyword extraction, context classification, bags-of-words, word embeddings, transformers for text processing to model privacy for text documents. Then I prove the security of this approach with e-differential privacy. △ Less

Submitted 29 October, 2020; originally announced October 2020.

arXiv:0912.1009 [pdf]

Biogeography based Satellite Image Classification

Authors: V. K. Panchal, Parminder Singh, Navdeep Kaur, Harish Kundra

Abstract: Biogeography is the study of the geographical distribution of biological organisms. The mindset of the engineer is that we can learn from nature. Biogeography Based Optimization is a burgeoning nature inspired technique to find the optimal solution of the problem. Satellite image classification is an important task because it is the only way we can know about the land cover map of inaccessible a… ▽ More Biogeography is the study of the geographical distribution of biological organisms. The mindset of the engineer is that we can learn from nature. Biogeography Based Optimization is a burgeoning nature inspired technique to find the optimal solution of the problem. Satellite image classification is an important task because it is the only way we can know about the land cover map of inaccessible areas. Though satellite images have been classified in past by using various techniques, the researchers are always finding alternative strategies for satellite image classification so that they may be prepared to select the most appropriate technique for the feature extraction task in hand. This paper is focused on classification of the satellite image of a particular land cover using the theory of Biogeography based Optimization. The original BBO algorithm does not have the inbuilt property of clustering which is required during image classification. Hence modifications have been proposed to the original algorithm and the modified algorithm is used to classify the satellite image of a given region. The results indicate that highly accurate land cover features can be extracted effectively when the proposed algorithm is used. △ Less

Submitted 5 December, 2009; originally announced December 2009.

Comments: 6 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS November 2009, ISSN 1947 5500, http://sites.google.com/site/ijcsis/

Report number: ISSN 1947 5500

Journal ref: International Journal of Computer Science and Information Security, IJCSIS, Vol. 6, No. 2, pp. 269-274, November 2009, USA

Showing 1–7 of 7 results for author: Panchal, K