Skip to main content

Showing 1–20 of 20 results for author: Malinovsky, G

.
  1. arXiv:2405.15593  [pdf, other

    cs.LG math.NA

    MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence

    Authors: Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtarik, Dan Alistarh

    Abstract: We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called MICROADAM that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instanc… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  2. arXiv:2403.06677  [pdf, other

    cs.LG cs.AI cs.DC

    Streamlining in the Riemannian Realm: Efficient Riemannian Optimization with Loopless Variance Reduction

    Authors: Yury Demidovich, Grigory Malinovsky, Peter Richtárik

    Abstract: In this study, we investigate stochastic optimization on Riemannian manifolds, focusing on the crucial variance reduction mechanism used in both Euclidean and Riemannian settings. Riemannian variance-reduced methods usually involve a double-loop structure, computing a full gradient at the start of each loop. Determining the optimal inner loop length is challenging in practice, as it depends on str… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  3. arXiv:2311.16086  [pdf, other

    cs.LG cs.AI cs.DC math.OC

    MAST: Model-Agnostic Sparsified Training

    Authors: Yury Demidovich, Grigory Malinovsky, Egor Shulgin, Peter Richtárik

    Abstract: We introduce a novel optimization problem formulation that departs from the conventional way of minimizing machine learning model loss as a black-box function. Unlike traditional formulations, the proposed approach explicitly incorporates an initially pre-trained model and random sketch operators, allowing for sparsification of both the model and gradient during training. We establish insightful p… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: 58 pages, 5 figures

  4. arXiv:2311.14127  [pdf, other

    cs.LG cs.AI cs.DC math.OC

    Byzantine Robustness and Partial Participation Can Be Achieved at Once: Just Clip Gradient Differences

    Authors: Grigory Malinovsky, Peter Richtárik, Samuel Horváth, Eduard Gorbunov

    Abstract: Distributed learning has emerged as a leading paradigm for training large machine learning models. However, in real-world scenarios, participants may be unreliable or malicious, posing a significant challenge to the integrity and accuracy of the trained models. Byzantine fault tolerance mechanisms have been proposed to address these issues, but they often assume full participation from all clients… ▽ More

    Submitted 7 June, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

    Comments: 52 pages; 4 figures. Changes in v2: a heuristic extension of the proposed method, new numerical results, a simpler presentation of the main results, and corrections of small typos

  5. arXiv:2306.03240  [pdf, other

    cs.LG

    Improving Accelerated Federated Learning with Compression and Importance Sampling

    Authors: Michał Grudzień, Grigory Malinovsky, Peter Richtárik

    Abstract: Federated Learning is a collaborative training framework that leverages heterogeneous data distributed across a vast number of clients. Since it is practically infeasible to request and process all clients during the aggregation step, partial participation must be supported. In this setting, the communication between the server and clients poses a major bottleneck. To reduce communication loads, t… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: 33 pages, 3 algorithms, 1 figure

  6. arXiv:2305.16296  [pdf, other

    cs.LG

    A Guide Through the Zoo of Biased SGD

    Authors: Yury Demidovich, Grigory Malinovsky, Igor Sokolov, Peter Richtárik

    Abstract: Stochastic Gradient Descent (SGD) is arguably the most important single algorithm in modern machine learning. Although SGD with unbiased gradient estimators has been studied extensively over at least half a century, SGD variants relying on biased estimators are rare. Nevertheless, there has been an increased interest in this topic in recent years. However, existing literature on SGD with biased es… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: 55 pages, 2 figures, 10 tables

  7. arXiv:2302.09832  [pdf, other

    cs.LG math.OC

    TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

    Authors: Laurent Condat, Ivan Agarský, Grigory Malinovsky, Peter Richtárik

    Abstract: In distributed optimization and learning, several machines alternate between local computations in parallel and communication with a distant server. Communication is usually slow and costly and forms the main bottleneck. This is particularly true in federated learning, where a large number of users collaborate toward a global training task. In addition, it is desirable for a robust algorithm to al… ▽ More

    Submitted 27 April, 2024; v1 submitted 20 February, 2023; originally announced February 2023.

    Comments: This work is a follow-up of our previous work introducing CompressedScaffnew in paper arXiv:2210.13277

  8. arXiv:2302.03662  [pdf, other

    cs.LG

    Federated Learning with Regularized Client Participation

    Authors: Grigory Malinovsky, Samuel Horváth, Konstantin Burlachenko, Peter Richtárik

    Abstract: Federated Learning (FL) is a distributed machine learning approach where multiple clients work together to solve a machine learning task. One of the key challenges in FL is the issue of partial participation, which occurs when a large number of clients are involved in the training process. The traditional method to address this problem is randomly selecting a subset of clients at each communicatio… ▽ More

    Submitted 28 February, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: 33 pages, 10 figures,1 algorithm, 3 theorems

  9. arXiv:2212.14439  [pdf, other

    math.OC cs.LG

    An Optimal Algorithm for Strongly Convex Min-min Optimization

    Authors: Alexander Gasnikov, Dmitry Kovalev, Grigory Malinovsky

    Abstract: In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{κ_x,κ_y\}} \log 1/ε)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $κ_x$ and $κ_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires… ▽ More

    Submitted 8 February, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

    Comments: 12 pages, 2 figures, 1 algorithm

  10. arXiv:2212.14370  [pdf, other

    cs.LG cs.DC math.OC

    Can 5th Generation Local Training Methods Support Client Sampling? Yes!

    Authors: Michał Grudzień, Grigory Malinovsky, Peter Richtárik

    Abstract: The celebrated FedAvg algorithm of McMahan et al. (2017) is based on three components: client sampling (CS), data sampling (DS) and local training (LT). While the first two are reasonably well understood, the third component, whose role is to reduce the number of communication rounds needed to train the model, resisted all attempts at a satisfactory theoretical explanation. Malinovsky et al. (2022… ▽ More

    Submitted 6 January, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

    Comments: 38 pages, 2 figures, 5 algorithms, 2 tables

  11. arXiv:2209.07883  [pdf, other

    math.OC cs.LG

    Minibatch Stochastic Three Points Method for Unconstrained Smooth Minimization

    Authors: Soumia Boucherouite, Grigory Malinovsky, Peter Richtárik, EL Houcine Bergou

    Abstract: In this paper, we propose a new zero order optimization method called minibatch stochastic three points (MiSTP) method to solve an unconstrained minimization problem in a setting where only an approximation of the objective function evaluation is possible. It is based on the recently proposed stochastic three points (STP) method (Bergou et al., 2020). At each iteration, MiSTP generates a random se… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

  12. arXiv:2207.04338  [pdf, other

    cs.LG cs.DC math.OC

    Variance Reduced ProxSkip: Algorithm, Theory and Application to Federated Learning

    Authors: Grigory Malinovsky, Kai Yi, Peter Richtárik

    Abstract: We study distributed optimization methods based on the {\em local training (LT)} paradigm: achieving communication efficiency by performing richer local gradient-based training on the clients before parameter averaging. Looking back at the progress of the field, we {\em identify 5 generations of LT methods}: 1) heuristic, 2) homogeneous, 3) sublinear, 4) linear, and 5) accelerated. The 5… ▽ More

    Submitted 9 July, 2022; originally announced July 2022.

    Comments: 38 pages, 2 algorithms, 4 theorems, 11 figures

  13. arXiv:2206.07021  [pdf, other

    cs.LG math.OC

    Federated Optimization Algorithms with Random Reshuffling and Gradient Compression

    Authors: Abdurakhmon Sadiev, Grigory Malinovsky, Eduard Gorbunov, Igor Sokolov, Ahmed Khaled, Konstantin Burlachenko, Peter Richtárik

    Abstract: Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling,… ▽ More

    Submitted 3 November, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: 66 pages, 6 figures. Changes in V2: the presentation of the results was changed, extra experiments were added. Code: https://github.com/IgorSokoloff/rr_with_compression_experiments_source_code

  14. arXiv:2205.03914  [pdf, other

    cs.LG math.OC

    Federated Random Reshuffling with Compression and Variance Reduction

    Authors: Grigory Malinovsky, Peter Richtárik

    Abstract: Random Reshuffling (RR), which is a variant of Stochastic Gradient Descent (SGD) employing sampling without replacement, is an immensely popular method for training supervised machine learning models via empirical risk minimization. Due to its superior practical performance, it is embedded and often set as default in standard machine learning software. Under the name FedRR, this method was recentl… ▽ More

    Submitted 10 May, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

    Comments: 18 pages, 1 table, 1 figure, 3 algorithms

  15. arXiv:2202.09357  [pdf, other

    cs.LG math.OC

    ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!

    Authors: Konstantin Mishchenko, Grigory Malinovsky, Sebastian Stich, Peter Richtárik

    Abstract: We introduce ProxSkip -- a surprisingly simple and provably efficient method for minimizing the sum of a smooth ($f$) and an expensive nonsmooth proximable ($ψ$) function. The canonical approach to solving such problems is via the proximal gradient descent (ProxGD) algorithm, which is based on the evaluation of the gradient of $f$ and the prox operator of $ψ$ in each iteration. In this work we are… ▽ More

    Submitted 24 March, 2023; v1 submitted 18 February, 2022; originally announced February 2022.

    Comments: 20 pages, 5 algorithms, 2 figures

  16. arXiv:2201.11066  [pdf, other

    cs.LG math.OC

    Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization

    Authors: Grigory Malinovsky, Konstantin Mishchenko, Peter Richtárik

    Abstract: We present a theoretical study of server-side optimization in federated learning. Our results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging (FedAvg) with local passes over the client data. Each local pass is performed without replacement using Random Reshuffling, which is a key reas… ▽ More

    Submitted 26 January, 2022; originally announced January 2022.

    Comments: 26 pages, 3 tables, 2 figures, 1 algorithm, conference paper

  17. arXiv:2111.05430  [pdf, other

    math.OC

    Averaged Heavy-Ball Method

    Authors: Marina Danilova, Grigory Malinovsky

    Abstract: Heavy-Ball method (HB) is known for its simplicity in implementation and practical efficiency. However, as with other momentum methods, it has non-monotone behavior, and for optimal parameters, the method suffers from the so-called peak effect. To address this issue, in this paper, we consider an averaged version of Heavy-Ball method (AHB). We show that for quadratic problems AHB has a smaller max… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

    Comments: 30 pages, 6 figures

  18. arXiv:2104.09342  [pdf, other

    cs.LG cs.AI math.OC

    Random Reshuffling with Variance Reduction: New Analysis and Better Rates

    Authors: Grigory Malinovsky, Alibek Sailanbayev, Peter Richtárik

    Abstract: Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. One of the tricks that works so well in practice that it is used as default in virtually all widely used machine learning software is {\em random reshuffling (RR)}. However, the practical bene… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: 24 pages, 5 figures, 4 algorithms, 1 table

  19. arXiv:2010.00952  [pdf, other

    math.OC cs.LG math.NA

    Distributed Proximal Splitting Algorithms with Rates and Acceleration

    Authors: Laurent Condat, Grigory Malinovsky, Peter Richtárik

    Abstract: We analyze several generic proximal splitting algorithms well suited for large-scale convex nonsmooth optimization. We derive sublinear and linear convergence results with new rates on the function value suboptimality or distance to the solution, as well as new accelerated versions, using varying stepsizes. In addition, we propose distributed variants of these algorithms, which can be accelerated… ▽ More

    Submitted 27 January, 2022; v1 submitted 2 October, 2020; originally announced October 2020.

  20. arXiv:2004.01442  [pdf, other

    cs.LG math.OC stat.ML

    From Local SGD to Local Fixed-Point Methods for Federated Learning

    Authors: Grigory Malinovsky, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, Peter Richtárik

    Abstract: Most algorithms for solving optimization problems or finding saddle points of convex-concave functions are fixed-point algorithms. In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. Our work is motivated by the needs of federated learning. In this context, each local operator models the computatio… ▽ More

    Submitted 16 June, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

    Comments: Accepted to ICML 2020