-
MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence
Authors:
Ionut-Vlad Modoranu,
Mher Safaryan,
Grigory Malinovsky,
Eldar Kurtic,
Thomas Robert,
Peter Richtarik,
Dan Alistarh
Abstract:
We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called MICROADAM that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instanc…
▽ More
We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called MICROADAM that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instance of the classical error feedback mechanism from distributed optimization [Seide et al., 2014, Alistarh et al., 2018, Karimireddy et al., 2019] in which the error correction information is itself compressed to allow for practical memory gains. We prove that the resulting approach maintains theoretical convergence guarantees competitive to those of AMSGrad, while providing good practical performance. Specifically, we show that MICROADAM can be implemented efficiently on GPUs: on both million-scale (BERT) and billion-scale (LLaMA) models, MicroAdam provides practical convergence competitive to that of the uncompressed Adam baseline, with lower memory usage and similar running time. Our code is available at https://github.com/IST-DASLab/MicroAdam.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Streamlining in the Riemannian Realm: Efficient Riemannian Optimization with Loopless Variance Reduction
Authors:
Yury Demidovich,
Grigory Malinovsky,
Peter Richtárik
Abstract:
In this study, we investigate stochastic optimization on Riemannian manifolds, focusing on the crucial variance reduction mechanism used in both Euclidean and Riemannian settings. Riemannian variance-reduced methods usually involve a double-loop structure, computing a full gradient at the start of each loop. Determining the optimal inner loop length is challenging in practice, as it depends on str…
▽ More
In this study, we investigate stochastic optimization on Riemannian manifolds, focusing on the crucial variance reduction mechanism used in both Euclidean and Riemannian settings. Riemannian variance-reduced methods usually involve a double-loop structure, computing a full gradient at the start of each loop. Determining the optimal inner loop length is challenging in practice, as it depends on strong convexity or smoothness constants, which are often unknown or hard to estimate. Motivated by Euclidean methods, we introduce the Riemannian Loopless SVRG (R-LSVRG) and PAGE (R-PAGE) methods. These methods replace the outer loop with probabilistic gradient computation triggered by a coin flip in each iteration, ensuring simpler proofs, efficient hyperparameter selection, and sharp convergence guarantees. Using R-PAGE as a framework for non-convex Riemannian optimization, we demonstrate its applicability to various important settings. For example, we derive Riemannian MARINA (R-MARINA) for distributed settings with communication compression, providing the best theoretical communication complexity guarantees for non-convex distributed optimization over Riemannian manifolds. Experimental results support our theoretical findings.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
MAST: Model-Agnostic Sparsified Training
Authors:
Yury Demidovich,
Grigory Malinovsky,
Egor Shulgin,
Peter Richtárik
Abstract:
We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish insightful p…
▽ More
We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish insightful properties of the proposed objective function and highlight its connections to the standard formulation. Furthermore, we present several variants of the Stochastic Gradient Descent (SGD) method adapted to the new problem formulation, including SGD with general sampling, a distributed version, and SGD with variance reduction techniques. We achieve tighter convergence rates and relax assumptions, bridging the gap between theoretical principles and practical applications, covering several important techniques such as Dropout and Sparse training. This work presents promising opportunities to enhance the theoretical understanding of model training through a sparsification-aware optimization approach.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences
Authors:
Grigory Malinovsky,
Peter Richtárik,
Samuel Horváth,
Eduard Gorbunov
Abstract:
Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients…
▽ More
Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients, which is not always practical due to the unavailability of some clients or communication constraints. In our work, we propose the first distributed method with client sampling and provable tolerance to Byzantine workers. The key idea behind the developed method is the use of gradient clip** to control stochastic gradient differences in recursive variance reduction. This allows us to bound the potential harm caused by Byzantine workers, even during iterations when all sampled clients are Byzantine. Furthermore, we incorporate communication compression into the method to enhance communication efficiency. Under general assumptions, we prove convergence rates for the proposed method that match the existing state-of-the-art (SOTA) theoretical results. We also propose a heuristic on adjusting any Byzantine-robust method to a partial participation scenario via clip**.
△ Less
Submitted 7 June, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
Improving Accelerated Federated Learning with Compression and Importance Sampling
Authors:
Michał Grudzień,
Grigory Malinovsky,
Peter Richtárik
Abstract:
Federated Learning is a collaborative training framework that leverages heterogeneous data distributed across a vast number of clients. Since it is practically infeasible to request and process all clients during the aggregation step, partial participation must be supported. In this setting, the communication between the server and clients poses a major bottleneck. To reduce communication loads, t…
▽ More
Federated Learning is a collaborative training framework that leverages heterogeneous data distributed across a vast number of clients. Since it is practically infeasible to request and process all clients during the aggregation step, partial participation must be supported. In this setting, the communication between the server and clients poses a major bottleneck. To reduce communication loads, there are two main approaches: compression and local steps. Recent work by Mishchenko et al. [2022] introduced the new ProxSkip method, which achieves an accelerated rate using the local steps technique. Follow-up works successfully combined local steps acceleration with partial participation [Grudzień et al., 2023, Condat et al. 2023] and gradient compression [Condat et al. [2022]. In this paper, we finally present a complete method for Federated Learning that incorporates all necessary ingredients: Local Training, Compression, and Partial Participation. We obtain state-of-the-art convergence guarantees in the considered setting. Moreover, we analyze the general sampling framework for partial participation and derive an importance sampling scheme, which leads to even better performance. We experimentally demonstrate the advantages of the proposed method in practice.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
A Guide Through the Zoo of Biased SGD
Authors:
Yury Demidovich,
Grigory Malinovsky,
Igor Sokolov,
Peter Richtárik
Abstract:
Stochastic Gradient Descent (SGD) is arguably the most important single algorithm in modern machine learning. Although SGD with unbiased gradient estimators has been studied extensively over at least half a century, SGD variants relying on biased estimators are rare. Nevertheless, there has been an increased interest in this topic in recent years. However, existing literature on SGD with biased es…
▽ More
Stochastic Gradient Descent (SGD) is arguably the most important single algorithm in modern machine learning. Although SGD with unbiased gradient estimators has been studied extensively over at least half a century, SGD variants relying on biased estimators are rare. Nevertheless, there has been an increased interest in this topic in recent years. However, existing literature on SGD with biased estimators (BiasedSGD) lacks coherence since each new paper relies on a different set of assumptions, without any clear understanding of how they are connected, which may lead to confusion. We address this gap by establishing connections among the existing assumptions, and presenting a comprehensive map of the underlying relationships. Additionally, we introduce a new set of assumptions that is provably weaker than all previous assumptions, and use it to present a thorough analysis of BiasedSGD in both convex and non-convex settings, offering advantages over previous results. We also provide examples where biased estimators outperform their unbiased counterparts or where unbiased versions are simply not available. Finally, we demonstrate the effectiveness of our framework through experimental results that validate our theoretical findings.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation
Authors:
Laurent Condat,
Ivan Agarský,
Grigory Malinovsky,
Peter Richtárik
Abstract:
In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. Communication is usually slow and costly and forms the main bottleneck. This is particularly true in federated learning, where a large number of users collaborate toward a global training task. In addition, it is desirable for a robust algorithm to al…
▽ More
In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. Communication is usually slow and costly and forms the main bottleneck. This is particularly true in federated learning, where a large number of users collaborate toward a global training task. In addition, it is desirable for a robust algorithm to allow for partial participation, since it is often the case that some clients are not able to participate to the entire process and are idle at certain times. Two strategies are popular to reduce the communication burden: 1) local training, which consists in communicating less frequently, or equivalently performing more local computations between the communication rounds; and 2) compression, whereby compressed information instead of full-dimensional vectors is communicated. We propose TAMUNA, the first algorithm for distributed optimization that leveraged the two strategies of local training and compression jointly and allows for partial participation. In the strongly convex setting, TAMUNA converges linearly to the exact solution and provably benefits from the two mechanisms: it exhibits a doubly-accelerated convergence rate, with respect to the condition number of the functions and the model dimension.
△ Less
Submitted 27 April, 2024; v1 submitted 20 February, 2023;
originally announced February 2023.
-
Federated Learning with Regularized Client Participation
Authors:
Grigory Malinovsky,
Samuel Horváth,
Konstantin Burlachenko,
Peter Richtárik
Abstract:
Federated Learning (FL) is a distributed machine learning approach where multiple clients work together to solve a machine learning task. One of the key challenges in FL is the issue of partial participation, which occurs when a large number of clients are involved in the training process. The traditional method to address this problem is randomly selecting a subset of clients at each communicatio…
▽ More
Federated Learning (FL) is a distributed machine learning approach where multiple clients work together to solve a machine learning task. One of the key challenges in FL is the issue of partial participation, which occurs when a large number of clients are involved in the training process. The traditional method to address this problem is randomly selecting a subset of clients at each communication round. In our research, we propose a new technique and design a novel regularized client participation scheme. Under this scheme, each client joins the learning process every $R$ communication rounds, which we refer to as a meta epoch. We have found that this participation scheme leads to a reduction in the variance caused by client sampling. Combined with the popular FedAvg algorithm (McMahan et al., 2017), it results in superior rates under standard assumptions. For instance, the optimization term in our main convergence bound decreases linearly with the product of the number of communication rounds and the size of the local dataset of each client, and the statistical term scales with step size quadratically instead of linearly (the case for client sampling with replacement), leading to better convergence rate $\mathcal{O}\left(\frac{1}{T^2}\right)$ compared to $\mathcal{O}\left(\frac{1}{T}\right)$, where $T$ is the total number of communication rounds. Furthermore, our results permit arbitrary client availability as long as each client is available for training once per each meta epoch.
△ Less
Submitted 28 February, 2023; v1 submitted 7 February, 2023;
originally announced February 2023.
-
An Optimal Algorithm for Strongly Convex Min-min Optimization
Authors:
Alexander Gasnikov,
Dmitry Kovalev,
Grigory Malinovsky
Abstract:
In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{κ_x,κ_y\}} \log 1/ε)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $κ_x$ and $κ_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires…
▽ More
In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{κ_x,κ_y\}} \log 1/ε)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $κ_x$ and $κ_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires $\mathcal{O}(\sqrt{κ_x} \log 1/ε)$ of computations of $\nabla_x f(x,y)$ and $\mathcal{O}(\sqrt{κ_y} \log 1/ε)$ computations of $\nabla_y f(x,y)$. In some applications $κ_x \gg κ_y$, and computation of $\nabla_y f(x,y)$ is significantly cheaper than computation of $\nabla_x f(x,y)$. In this case, our algorithm substantially outperforms the existing state-of-the-art methods.
△ Less
Submitted 8 February, 2023; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Can 5th Generation Local Training Methods Support Client Sampling? Yes!
Authors:
Michał Grudzień,
Grigory Malinovsky,
Peter Richtárik
Abstract:
The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022…
▽ More
The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022) identified four distinct generations of LT methods based on the quality of the provided theoretical communication complexity guarantees. Despite a lot of progress in this area, none of the existing works were able to show that it is theoretically better to employ multiple local gradient-type steps (i.e., to engage in LT) than to rely on a single local gradient-type step only in the important heterogeneous data regime. In a recent breakthrough embodied in their ProxSkip method and its theoretical analysis, Mishchenko et al. (2022) showed that LT indeed leads to provable communication acceleration for arbitrarily heterogeneous data, thus jump-starting the $5^{\rm th}$ generation of LT methods. However, while these latest generation LT methods are compatible with DS, none of them support CS. We resolve this open problem in the affirmative. In order to do so, we had to base our algorithmic development on new algorithmic and theoretical foundations.
△ Less
Submitted 6 January, 2023; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization
Authors:
Soumia Boucherouite,
Grigory Malinovsky,
Peter Richtárik,
EL Houcine Bergou
Abstract:
In this paper, we propose a new zero order optimization method called minibatch stochastic three points (MiSTP) method to solve an unconstrained minimization problem in a setting where only an approximation of the objective function evaluation is possible. It is based on the recently proposed stochastic three points (STP) method (Bergou et al., 2020). At each iteration, MiSTP generates a random se…
▽ More
In this paper, we propose a new zero order optimization method called minibatch stochastic three points (MiSTP) method to solve an unconstrained minimization problem in a setting where only an approximation of the objective function evaluation is possible. It is based on the recently proposed stochastic three points (STP) method (Bergou et al., 2020). At each iteration, MiSTP generates a random search direction in a similar manner to STP, but chooses the next iterate based solely on the approximation of the objective function rather than its exact evaluations. We also analyze our method's complexity in the nonconvex and convex cases and evaluate its performance on multiple machine learning tasks.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
Variance Reduced ProxSkip: Algorithm, Theory and Application to Federated Learning
Authors:
Grigory Malinovsky,
Kai Yi,
Peter Richtárik
Abstract:
We study distributed optimization methods based on the {\em local training (LT)} paradigm: achieving communication efficiency by performing richer local gradient-based training on the clients before parameter averaging. Looking back at the progress of the field, we {\em identify 5 generations of LT methods}: 1) heuristic, 2) homogeneous, 3) sublinear, 4) linear, and 5) accelerated. The 5…
▽ More
We study distributed optimization methods based on the {\em local training (LT)} paradigm: achieving communication efficiency by performing richer local gradient-based training on the clients before parameter averaging. Looking back at the progress of the field, we {\em identify 5 generations of LT methods}: 1) heuristic, 2) homogeneous, 3) sublinear, 4) linear, and 5) accelerated. The 5${}^{\rm th}$ generation, initiated by the ProxSkip method of Mishchenko, Malinovsky, Stich and Richtárik (2022) and its analysis, is characterized by the first theoretical confirmation that LT is a communication acceleration mechanism. Inspired by this recent progress, we contribute to the 5${}^{\rm th}$ generation of LT methods by showing that it is possible to enhance them further using {\em variance reduction}. While all previous theoretical results for LT methods ignore the cost of local work altogether, and are framed purely in terms of the number of communication rounds, we show that our methods can be substantially faster in terms of the {\em total training cost} than the state-of-the-art method ProxSkip in theory and practice in the regime when local computation is sufficiently expensive. We characterize this threshold theoretically, and confirm our theoretical predictions with empirical results.
△ Less
Submitted 9 July, 2022;
originally announced July 2022.
-
Federated Optimization Algorithms with Random Reshuffling and Gradient Compression
Authors:
Abdurakhmon Sadiev,
Grigory Malinovsky,
Eduard Gorbunov,
Igor Sokolov,
Ahmed Khaled,
Konstantin Burlachenko,
Peter Richtárik
Abstract:
Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling,…
▽ More
Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a naïve combination of random reshuffling with gradient compression (Q-RR). Perhaps surprisingly, but the theoretical analysis of Q-RR does not show any benefits of using RR. Our extensive numerical experiments confirm this phenomenon. This happens due to the additional compression variance. To reveal the true advantages of RR in the distributed learning with compression, we propose a new method called DIANA-RR that reduces the compression variance and has provably better convergence rates than existing counterparts with with-replacement sampling of stochastic gradients. Next, to have a better fit to Federated Learning applications, we incorporate local computation, i.e., we propose and analyze the variants of Q-RR and DIANA-RR -- Q-NASTYA and DIANA-NASTYA that use local gradient steps and different local and global stepsizes. Finally, we conducted several numerical experiments to illustrate our theoretical results.
△ Less
Submitted 3 November, 2022; v1 submitted 14 June, 2022;
originally announced June 2022.
-
Federated Random Reshuffling with Compression and Variance Reduction
Authors:
Grigory Malinovsky,
Peter Richtárik
Abstract:
Random Reshuffling (RR), which is a variant of Stochastic Gradient Descent (SGD) employing sampling without replacement, is an immensely popular method for training supervised machine learning models via empirical risk minimization. Due to its superior practical performance, it is embedded and often set as default in standard machine learning software. Under the name FedRR, this method was recentl…
▽ More
Random Reshuffling (RR), which is a variant of Stochastic Gradient Descent (SGD) employing sampling without replacement, is an immensely popular method for training supervised machine learning models via empirical risk minimization. Due to its superior practical performance, it is embedded and often set as default in standard machine learning software. Under the name FedRR, this method was recently shown to be applicable to federated learning (Mishchenko et al.,2021), with superior performance when compared to common baselines such as Local SGD. Inspired by this development, we design three new algorithms to improve FedRR further: compressed FedRR and two variance reduced extensions: one for taming the variance coming from shuffling and the other for taming the variance due to compression. The variance reduction mechanism for compression allows us to eliminate dependence on the compression parameter, and applying additional controlled linear perturbations for Random Reshuffling, introduced by Malinovsky et al.(2021) helps to eliminate variance at the optimum. We provide the first analysis of compressed local methods under standard assumptions without bounded gradient assumptions and for heterogeneous data, overcoming the limitations of the compression operator. We corroborate our theoretical results with experiments on synthetic and real data sets.
△ Less
Submitted 10 May, 2022; v1 submitted 8 May, 2022;
originally announced May 2022.
-
ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!
Authors:
Konstantin Mishchenko,
Grigory Malinovsky,
Sebastian Stich,
Peter Richtárik
Abstract:
We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($ψ$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $ψ$ in each iteration. In this work we are…
▽ More
We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($ψ$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $ψ$ in each iteration. In this work we are specifically interested in the regime in which the evaluation of prox is costly relative to the evaluation of the gradient, which is the case in many applications. ProxSkip allows for the expensive prox operator to be skipped in most iterations: while its iteration complexity is $\mathcal{O}\left(κ\log \frac{1}{\varepsilon}\right)$, where $κ$ is the condition number of $f$, the number of prox evaluations is $\mathcal{O}\left(\sqrtκ \log \frac{1}{\varepsilon}\right)$ only. Our main motivation comes from federated learning, where evaluation of the gradient operator corresponds to taking a local GD step independently on all devices, and evaluation of prox corresponds to (expensive) communication in the form of gradient averaging. In this context, ProxSkip offers an effective acceleration of communication complexity. Unlike other local gradient-type methods, such as FedAvg, SCAFFOLD, S-Local-GD and FedLin, whose theoretical communication complexity is worse than, or at best matching, that of vanilla GD in the heterogeneous data regime, we obtain a provable and large improvement without any heterogeneity-bounding assumptions.
△ Less
Submitted 24 March, 2023; v1 submitted 18 February, 2022;
originally announced February 2022.
-
Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization
Authors:
Grigory Malinovsky,
Konstantin Mishchenko,
Peter Richtárik
Abstract:
We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging (FedAvg) with local passes over the client data. Each local pass is performed without replacement using Random Reshuffling, which is a key reas…
▽ More
We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging (FedAvg) with local passes over the client data. Each local pass is performed without replacement using Random Reshuffling, which is a key reason we can show improved complexities. In particular, we prove that whenever the local stepsizes are small, and the update direction is given by FedAvg in conjunction with Random Reshuffling over all clients, one can take a big leap in the obtained direction and improve rates for convex, strongly convex, and non-convex objectives. In particular, in non-convex regime we get an enhancement of the rate of convergence from $\mathcal{O}\left(\varepsilon^{-3}\right)$ to $\mathcal{O}\left(\varepsilon^{-2}\right)$. This result is new even for Random Reshuffling performed on a single node. In contrast, if the local stepsizes are large, we prove that the noise of client sampling can be controlled by using a small server-side stepsize. To the best of our knowledge, this is the first time that local steps provably help to overcome the communication bottleneck. Together, our results on the advantage of large and small server-side stepsizes give a formal justification for the practice of adaptive server-side optimization in federated learning. Moreover, we consider a variant of our algorithm that supports partial client participation, which makes the method more practical.
△ Less
Submitted 26 January, 2022;
originally announced January 2022.
-
Averaged Heavy-Ball Method
Authors:
Marina Danilova,
Grigory Malinovsky
Abstract:
Heavy-Ball method (HB) is known for its simplicity in implementation and practical efficiency. However, as with other momentum methods, it has non-monotone behavior, and for optimal parameters, the method suffers from the so-called peak effect. To address this issue, in this paper, we consider an averaged version of Heavy-Ball method (AHB). We show that for quadratic problems AHB has a smaller max…
▽ More
Heavy-Ball method (HB) is known for its simplicity in implementation and practical efficiency. However, as with other momentum methods, it has non-monotone behavior, and for optimal parameters, the method suffers from the so-called peak effect. To address this issue, in this paper, we consider an averaged version of Heavy-Ball method (AHB). We show that for quadratic problems AHB has a smaller maximal deviation from the solution than HB. Moreover, for general convex and strongly convex functions, we prove non-accelerated rates of global convergence of AHB and its weighted version. We conduct several numerical experiments on minimizing quadratic and non-quadratic functions to demonstrate the advantages of using averaging for HB.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Random Reshuffling with Variance Reduction: New Analysis and Better Rates
Authors:
Grigory Malinovsky,
Alibek Sailanbayev,
Peter Richtárik
Abstract:
Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. One of the tricks that works so well in practice that it is used as default in virtually all widely used machine learning software is {\em random reshuffling (RR)}. However, the practical bene…
▽ More
Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. One of the tricks that works so well in practice that it is used as default in virtually all widely used machine learning software is {\em random reshuffling (RR)}. However, the practical benefits of RR have until very recently been eluding attempts at being satisfactorily explained using theory. Motivated by recent development due to Mishchenko, Khaled and Richtárik (2020), in this work we provide the first analysis of SVRG under Random Reshuffling (RR-SVRG) for general finite-sum problems. First, we show that RR-SVRG converges linearly with the rate $\mathcal{O}(κ^{3/2})$ in the strongly-convex case, and can be improved further to $\mathcal{O}(κ)$ in the big data regime (when $n > \mathcal{O}(κ)$), where $κ$ is the condition number. This improves upon the previous best rate $\mathcal{O}(κ^2)$ known for a variance reduced RR method in the strongly-convex case due to Ying, Yuan and Sayed (2020). Second, we obtain the first sublinear rate for general convex problems. Third, we establish similar fast rates for Cyclic-SVRG and Shuffle-Once-SVRG. Finally, we develop and analyze a more general variance reduction scheme for RR, which allows for less frequent updates of the control variate. We corroborate our theoretical results with suitably chosen experiments on synthetic and real datasets.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Distributed Proximal Splitting Algorithms with Rates and Acceleration
Authors:
Laurent Condat,
Grigory Malinovsky,
Peter Richtárik
Abstract:
We analyze several generic proximal splitting algorithms well suited for large-scale convex nonsmooth optimization. We derive sublinear and linear convergence results with new rates on the function value suboptimality or distance to the solution, as well as new accelerated versions, using varying stepsizes. In addition, we propose distributed variants of these algorithms, which can be accelerated…
▽ More
We analyze several generic proximal splitting algorithms well suited for large-scale convex nonsmooth optimization. We derive sublinear and linear convergence results with new rates on the function value suboptimality or distance to the solution, as well as new accelerated versions, using varying stepsizes. In addition, we propose distributed variants of these algorithms, which can be accelerated as well. While most existing results are ergodic, our nonergodic results significantly broaden our understanding of primal-dual optimization algorithms.
△ Less
Submitted 27 January, 2022; v1 submitted 2 October, 2020;
originally announced October 2020.
-
From Local SGD to Local Fixed-Point Methods for Federated Learning
Authors:
Grigory Malinovsky,
Dmitry Kovalev,
Elnur Gasanov,
Laurent Condat,
Peter Richtárik
Abstract:
Most algorithms for solving optimization problems or finding saddle points of convex-concave functions are fixed-point algorithms. In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. Our work is motivated by the needs of federated learning. In this context, each local operator models the computatio…
▽ More
Most algorithms for solving optimization problems or finding saddle points of convex-concave functions are fixed-point algorithms. In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. Our work is motivated by the needs of federated learning. In this context, each local operator models the computations done locally on a mobile device. We investigate two strategies to achieve such a consensus: one based on a fixed number of local steps, and the other based on randomized computations. In both cases, the goal is to limit communication of the locally-computed variables, which is often the bottleneck in distributed frameworks. We perform convergence analysis of both methods and conduct a number of experiments highlighting the benefits of our approach.
△ Less
Submitted 16 June, 2020; v1 submitted 3 April, 2020;
originally announced April 2020.