Search | arXiv e-print repository

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Authors: Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, Edouard Oyallon

Abstract: Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, which can impede the efficiency gains of parallelization. To address this challenge,… ▽ More Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, which can impede the efficiency gains of parallelization. To address this challenge, optimization algorithms reducing inter-worker communication have emerged, such as local optimization methods used in Federated Learning. While effective in minimizing communication overhead, these methods incur significant memory costs, hindering scalability: in addition to extra momentum variables, if communications are only allowed between multiple local optimization steps, then the optimizer's states cannot be sharded among workers. In response, we propose $\textbf{AC}$cumulate while $\textbf{CO}$mmunicate ($\texttt{ACCO}$), a memory-efficient optimization algorithm tailored for distributed training of LLMs. $\texttt{ACCO}$ allows to shard optimizer states across workers, overlaps gradient computations and communications to conceal communication costs, and accommodates heterogeneous hardware. Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications, eliminating the need for warmup steps and aligning with the training dynamics of standard distributed optimization while converging faster in terms of wall-clock time. We demonstrate the effectiveness of $\texttt{ACCO}$ on several LLMs training and fine-tuning tasks. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.17517 [pdf, other]

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Authors: Louis Fournier, Adel Nabli, Masih Aminbeidokhti, Marco Pedersoli, Eugene Belilovsky, Edouard Oyallon

Abstract: The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as model… ▽ More The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2306.08289 [pdf, other]

$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning

Authors: Adel Nabli, Eugene Belilovsky, Edouard Oyallon

Abstract: Distributed training of Deep Learning models has been critical to many recent successes in the field. Current standard methods primarily rely on synchronous centralized algorithms which induce major communication bottlenecks and synchronization locks at scale. Decentralized asynchronous algorithms are emerging as a potential alternative but their practical applicability still lags. In order to mit… ▽ More Distributed training of Deep Learning models has been critical to many recent successes in the field. Current standard methods primarily rely on synchronous centralized algorithms which induce major communication bottlenecks and synchronization locks at scale. Decentralized asynchronous algorithms are emerging as a potential alternative but their practical applicability still lags. In order to mitigate the increase in communication cost that naturally comes with scaling the number of workers, we introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named $\textbf{A}^2\textbf{CiD}^2$. Our method allows each worker to continuously process mini-batches without stop**, and run a peer-to-peer averaging routine in parallel, reducing idle time. In addition to inducing a significant communication acceleration at no cost other than adding a local momentum variable, minimal adaptation is required to incorporate $\textbf{A}^2\textbf{CiD}^2$ to standard asynchronous approaches. Our theoretical analysis proves accelerated rates compared to previous asynchronous decentralized baselines and we empirically show that using our $\textbf{A}^2\textbf{CiD}^2$ momentum significantly decrease communication costs in poorly connected networks. In particular, we show consistent improvement on the ImageNet dataset using up to 64 asynchronous workers (A100 GPUs) and various communication network topologies. △ Less

Submitted 6 December, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

Journal ref: Thirty-seventh Conference on Neural Information Processing Systems, Dec 2023, New Orleans, United States

arXiv:2208.00779 [pdf, ps, other]

DADAO: Decoupled Accelerated Decentralized Asynchronous Optimization

Authors: Adel Nabli, Edouard Oyallon

Abstract: This work introduces DADAO: the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $μ$-strongly convex functions distributed over a given network of size $n$. Our key insight is based on modeling the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes. This allows us to decoupl… ▽ More This work introduces DADAO: the first decentralized, accelerated, asynchronous, primal, first-order algorithm to minimize a sum of $L$-smooth and $μ$-strongly convex functions distributed over a given network of size $n$. Our key insight is based on modeling the local gradient updates and gossip communication procedures with separate independent Poisson Point Processes. This allows us to decouple the computation and communication steps, which can be run in parallel, while making the whole approach completely asynchronous. This leads to communication acceleration compared to synchronous approaches. Our new method employs primal gradients and does not use a multi-consensus inner loop nor other ad-hoc mechanisms such as Error Feedback, Gradient Tracking, or a Proximal operator. By relating the inverse of the smallest positive eigenvalue of the Laplacian matrix $χ_1$ and the maximal resistance $χ_2\leq χ_1$ of the graph to a sufficient minimal communication rate between the nodes of the network, we show that our algorithm requires $\mathcal{O}(n\sqrt{\frac{L}μ}\log(\frac{1}ε))$ local gradients and only $\mathcal{O}(n\sqrt{χ_1χ_2}\sqrt{\frac{L}μ}\log(\frac{1}ε))$ communications to reach a precision $ε$, up to logarithmic terms. Thus, we simultaneously obtain an accelerated rate for both computations and communications, leading to an improvement over state-of-the-art works, our simulations further validating the strength of our relatively unconstrained method. △ Less

Submitted 6 December, 2023; v1 submitted 26 July, 2022; originally announced August 2022.

Comments: International Conference on Machine Learning, Jul 2023, Honolulu, United States

arXiv:2204.05148 [pdf, other]

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Authors: Robin Algayres, Adel Nabli, Benoit Sagot, Emmanuel Dupoux

Abstract: We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of rando… ▽ More We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations, this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-by-example task on the LibriSpeech dataset to monitor future improvements in the field. △ Less

Submitted 21 October, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Interspeech 2022 New version on 10/21/23 with appendix data and gitlab link

arXiv:2007.03151 [pdf, other]

Curriculum learning for multilevel budgeted combinatorial problems

Authors: Adel Nabli, Margarida Carvalho

Abstract: Learning heuristics for combinatorial optimization problems through graph neural networks have recently shown promising results on some classic NP-hard problems. These are single-level optimization problems with only one player. Multilevel combinatorial optimization problems are their generalization, encompassing situations with multiple players taking decisions sequentially. By framing them in a… ▽ More Learning heuristics for combinatorial optimization problems through graph neural networks have recently shown promising results on some classic NP-hard problems. These are single-level optimization problems with only one player. Multilevel combinatorial optimization problems are their generalization, encompassing situations with multiple players taking decisions sequentially. By framing them in a multi-agent reinforcement learning setting, we devise a value-based method to learn to solve multilevel budgeted combinatorial problems involving two players in a zero-sum game over a graph. Our framework is based on a simple curriculum: if an agent knows how to estimate the value of instances with budgets up to $B$, then solving instances with budget $B+1$ can be done in polynomial time regardless of the direction of the optimization by checking the value of every possible afterstate. Thus, in a bottom-up approach, we generate datasets of heuristically solved instances with increasingly larger budgets to train our agent. We report results close to optimality on graphs up to $100$ nodes and a $185 \times$ speedup on average compared to the quickest exact solver known for the Multilevel Critical Node problem, a max-min-max trilevel problem that has been shown to be at least $Σ_2^p$-hard. △ Less

Submitted 26 October, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: NeurIPS 2020, December 2020

arXiv:2007.02370 [pdf, ps, other]

Complexity of the Multilevel Critical Node Problem

Authors: Adel Nabli, Margarida Carvalho, Pierre Hosteins

Abstract: In this work, we analyze a sequential game played in a graph called the Multilevel Critical Node problem (MCN). A defender and an attacker are the players of this game. The defender starts by preventively interdicting vertices (vaccination) from being attacked. Then, the attacker infects a subset of non-vaccinated vertices and, finally, the defender reacts with a protection strategy. We provide th… ▽ More In this work, we analyze a sequential game played in a graph called the Multilevel Critical Node problem (MCN). A defender and an attacker are the players of this game. The defender starts by preventively interdicting vertices (vaccination) from being attacked. Then, the attacker infects a subset of non-vaccinated vertices and, finally, the defender reacts with a protection strategy. We provide the first computational complexity results associated with MCN and its subgames. Moreover, by considering unitary, weighted, undirected, and directed graphs, we clarify how the theoretical tractability of those problems vary. Our findings contribute with new NP-complete, $Σ_2^p$-complete and $Σ_3^p$-complete problems. Furthermore, for the last level of the game, the protection stage, we build polynomial time algorithms for certain graph classes. △ Less

Submitted 2 October, 2020; v1 submitted 5 July, 2020; originally announced July 2020.

arXiv:1203.3589 [pdf]

Building MultiView Analyst Profile From Multidimensional Query Logs: From Consensual to Conflicting Preferences

Authors: Eya Ben Ahmed, Ahlem Nabli, Faïez Gargouri

Abstract: In order to provide suitable results to the analyst needs, user preferences summarization is widely used in several domains. In this paper, we introduce a new approach for user profile construction from OLAP query logs. The key idea is to learn the user's preferences by drawing the evidence from OLAP logs. In fact, the analyst preferences are clustered into three main pools : (i) consensual or non… ▽ More In order to provide suitable results to the analyst needs, user preferences summarization is widely used in several domains. In this paper, we introduce a new approach for user profile construction from OLAP query logs. The key idea is to learn the user's preferences by drawing the evidence from OLAP logs. In fact, the analyst preferences are clustered into three main pools : (i) consensual or non conflicting preferences referring to same preferences for all analysts; (ii) semi-conflicting preferences corresponding to similar preferences for some analysts; (iii) conflicting preferences related to disjoint preferences for all analysts. To build generic and global model accurately describing the analyst, we enrich the obtained characteristics through including several views, namely the personal view, the professional view and the behavioral view. After that, the multiview profile extracted from multidimensional database can be annotated. △ Less

Submitted 15 March, 2012; originally announced March 2012.

Comments: 8 pages

Journal ref: IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 2, January 2012 ISSN (Online): 1694-0814 www.IJCSI.org

arXiv:1112.5957

Usage Des Mesures Pour La Génération Des Règles d'Associations Cycliques

Authors: Eya Ben Ahmed, Ahlem Nabli, Faïez Gargouri

Abstract: The online analytical processing (OLAP) does not provide any explanation of correlations discovered between data. Thus, the coupling of OLAP and data mining, especially association rules, is considered as an efficient solution to this problem. In this context, we mainly focus on a particular class of association rules which is the cyclic association rules. These rules aimed to discover patterns th… ▽ More The online analytical processing (OLAP) does not provide any explanation of correlations discovered between data. Thus, the coupling of OLAP and data mining, especially association rules, is considered as an efficient solution to this problem. In this context, we mainly focus on a particular class of association rules which is the cyclic association rules. These rules aimed to discover patterns that display regular variation over user-defined intervals. Generally,the generated patterns do not take an advantage from the specificities of the multidimensional context namely, the consideration of the measures and their aggregations. In this paper, we introduce a novel method for extracting cyclic association rules from measures, and we redefine the evaluation metrics of association rules quality inspired of the temporal summarizability of measures concept through the integration of appropriate aggregation functions. To prove the usefulness of our approach, we conduct an empirical study on a real data warehouse. △ Less

Submitted 9 September, 2012; v1 submitted 27 December, 2011; originally announced December 2011.

Comments: 18 pages, 3 figures; 7 ème journées Francophones sur les Entrepôts de données et l'Analyse en ligne (EDA'2011)

arXiv:1107.1779 [pdf]

A Survey of User-Centric Data Warehouses: From Personalization to Recommendation

Authors: Eya Ben Ahmed, Ahlem Nabli, Faïez Gargouri

Abstract: Providing a customized support for the OLAP brings tremendous challenges to the OLAP technology. Standing at the crossroads of the preferences and the data warehouse, two emerging trends are pointed out; namely: (i) the personalization and (ii) the recommendation. Although the panoply of the proposed approaches, the user-centric data warehouse community issues have not been addressed yet. In this… ▽ More Providing a customized support for the OLAP brings tremendous challenges to the OLAP technology. Standing at the crossroads of the preferences and the data warehouse, two emerging trends are pointed out; namely: (i) the personalization and (ii) the recommendation. Although the panoply of the proposed approaches, the user-centric data warehouse community issues have not been addressed yet. In this paper we draw an overview of several user centric data warehouse proposals. We also discuss the two promising concepts in this issue, namely, the personalization and the recommendation of the data warehouses. We compare the current approaches among each others with respect to some criteria. △ Less

Submitted 9 July, 2011; originally announced July 2011.

Comments: 13 pages, 3 figures, 1 table

Journal ref: The International Journal of Database Management Systems (IJDMS), May 2011, Volume 3, Number 2

Showing 1–10 of 10 results for author: Nabli, A