Skip to main content

Showing 1–50 of 71 results for author: Richtárik, P

Searching in archive stat. Search in all archives.
.
  1. arXiv:2405.15545  [pdf, other

    math.OC cs.LG stat.ML

    Freya PAGE: First Optimal Time Complexity for Large-Scale Nonconvex Finite-Sum Optimization with Heterogeneous Asynchronous Computations

    Authors: Alexander Tyurin, Kaja Gruntkowska, Peter Richtárik

    Abstract: In practical distributed systems, workers are typically not homogeneous, and due to differences in hardware configurations and network conditions, can have highly varying processing times. We consider smooth nonconvex finite-sum (empirical risk minimization) problems in this setup and introduce a new parallel method, Freya PAGE, designed to handle arbitrarily heterogeneous and asynchronous computa… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: 43 pages, 2 figures

  2. arXiv:2402.10774  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Error Feedback Reloaded: From Quadratic to Arithmetic Mean of Smoothness Constants

    Authors: Peter Richtárik, Elnur Gasanov, Konstantin Burlachenko

    Abstract: Error Feedback (EF) is a highly popular and immensely effective mechanism for fixing convergence issues which arise in distributed training methods (such as distributed GD or SGD) when these are enhanced with greedy communication compression techniques such as TopK. While EF was proposed almost a decade ago (Seide et al., 2014), and despite concentrated effort by the community to advance the theor… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

    Comments: 70 pages, 14 figures, 6 tables

    MSC Class: 90C26; 74Pxx ACM Class: G.1.6; I.2.11; I.2.m

  3. arXiv:2402.06412  [pdf, other

    math.OC cs.LG stat.ML

    Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

    Authors: Kaja Gruntkowska, Alexander Tyurin, Peter Richtárik

    Abstract: Effective communication between the server and workers plays a key role in distributed optimization. In this paper, we focus on optimizing the server-to-worker communication, uncovering inefficiencies in prevalent downlink compression approaches. Considering first the pure setup where the uplink communication costs are negligible, we introduce MARINA-P, a novel method for downlink compression, emp… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

  4. arXiv:2305.18929  [pdf, other

    cs.LG math.OC stat.ML

    Clip21: Error Feedback for Gradient Clip**

    Authors: Sarit Khirirat, Eduard Gorbunov, Samuel Horváth, Rustem Islamov, Fakhri Karray, Peter Richtárik

    Abstract: Motivated by the increasing popularity and importance of large-scale training under differential privacy (DP) constraints, we study distributed gradient methods with gradient clip**, i.e., clip** applied to the gradients computed from local information at the nodes. While gradient clip** is an essential tool for injecting formal DP guarantees into gradient-based methods [1], it also induces… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  5. arXiv:2305.18627  [pdf, other

    cs.LG cs.DC stat.ML

    Global-QSGD: Practical Floatless Quantization for Distributed Learning with Theoretical Guarantees

    Authors: Jihao Xin, Marco Canini, Peter Richtárik, Samuel Horváth

    Abstract: Efficient distributed training is a principal driver of recent advances in deep learning. However, communication often proves costly and becomes the primary bottleneck in these systems. As a result, there is a demand for the design of efficient communication mechanisms that can empirically boost throughput while providing theoretical guarantees. In this work, we introduce Global-QSGD, a novel fami… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  6. arXiv:2305.15264  [pdf, other

    math.OC cs.DC cs.LG stat.ML

    Error Feedback Shines when Features are Rare

    Authors: Peter Richtárik, Elnur Gasanov, Konstantin Burlachenko

    Abstract: We provide the first proof that gradient descent $\left({\color{green}\sf GD}\right)$ with greedy sparsification $\left({\color{green}\sf TopK}\right)$ and error feedback $\left({\color{green}\sf EF}\right)$ can obtain better communication complexity than vanilla ${\color{green}\sf GD}$ when solving the distributed optimization problem… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

  7. arXiv:2303.04622  [pdf, ps, other

    stat.ML cs.LG math.OC

    ELF: Federated Langevin Algorithms with Primal, Dual and Bidirectional Compression

    Authors: Avetik Karagulyan, Peter Richtárik

    Abstract: Federated sampling algorithms have recently gained great popularity in the community of machine learning and statistics. This paper studies variants of such algorithms called Error Feedback Langevin algorithms (ELF). In particular, we analyze the combinations of EF21 and EF21-P with the federated Langevin Monte-Carlo. We propose three algorithms: P-ELF, D-ELF, and B-ELF that use, respectively, pri… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

  8. arXiv:2206.02275  [pdf, other

    cs.LG math.OC stat.ML

    Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling

    Authors: Alexander Tyurin, Lukang Sun, Konstantin Burlachenko, Peter Richtárik

    Abstract: We revisit the classical problem of finding an approximately stationary point of the average of $n$ smooth and possibly nonconvex functions. The optimal complexity of stochastic first-order methods in terms of the number of gradient evaluations of individual functions is $\mathcal{O}\left(n + n^{1/2}\varepsilon^{-1}\right)$, attained by the optimal SGD methods $\small\sf\color{green}{SPIDER}$(arXi… ▽ More

    Submitted 5 June, 2022; originally announced June 2022.

    Comments: 25 pages, 6 figures

    MSC Class: 90C26; 65K05 ACM Class: F.2.1; I.2.6

  9. arXiv:2204.13169  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    FedShuffle: Recipes for Better Use of Local Work in Federated Learning

    Authors: Samuel Horváth, Maziar Sanjabi, Lin Xiao, Peter Richtárik, Michael Rabbat

    Abstract: The practice of applying several local updates before aggregation across clients has been empirically shown to be a successful approach to overcoming the communication bottleneck in Federated Learning (FL). Such methods are usually implemented by having clients perform one or more epochs of local training per round while randomly reshuffling their finite dataset in each epoch. Data imbalance, wher… ▽ More

    Submitted 27 September, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

    Comments: Published in Transactions on Machine Learning Research (09/2022)

  10. arXiv:2201.13320  [pdf, other

    cs.LG cs.DC cs.DS math.OC stat.ML

    BEER: Fast $O(1/T)$ Rate for Decentralized Nonconvex Optimization with Communication Compression

    Authors: Haoyu Zhao, Boyue Li, Zhize Li, Peter Richtárik, Yuejie Chi

    Abstract: Communication efficiency has been widely recognized as the bottleneck for large-scale decentralized machine learning applications in multi-agent or federated environments. To tackle the communication bottleneck, there have been many efforts to design communication-compressed algorithms for decentralized nonconvex optimization, where the clients are only allowed to communicate a small amount of qua… ▽ More

    Submitted 13 October, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

    Comments: NeurIPS 2022

  11. arXiv:2111.11556  [pdf, other

    cs.LG math.OC stat.ML

    FLIX: A Simple and Communication-Efficient Alternative to Local Methods in Federated Learning

    Authors: Elnur Gasanov, Ahmed Khaled, Samuel Horváth, Peter Richtárik

    Abstract: Federated Learning (FL) is an increasingly popular machine learning paradigm in which multiple nodes try to collaboratively learn under privacy, communication and multiple heterogeneity constraints. A persistent problem in federated learning is that it is not clear what the optimization objective should be: the standard average risk minimization of supervised learning is inadequate in handling sev… ▽ More

    Submitted 23 February, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

    Comments: V2: includes non-convex analysis as well as new large-scale experiments with neural networks. To appear in AISTATS 2022

  12. arXiv:2110.03313  [pdf, other

    cs.LG stat.ML

    Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

    Authors: Aleksandr Beznosikov, Peter Richtárik, Michael Diskin, Max Ryabinin, Alexander Gasnikov

    Abstract: Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across various applications, we need to rely on parallel and distributed computing. However, in distributed tr… ▽ More

    Submitted 2 April, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: Appears in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Minor modifications with respect to the NeurIPS version. 73 pages, 9 algorithms, 2 figures, 2 tables

    Journal ref: https://proceedings.neurips.cc/paper_files/paper/2022/hash/5ac1428c23b5da5e66d029646ea3206d-Abstract-Conference.html

  13. arXiv:2110.03300  [pdf, ps, other

    cs.LG math.OC stat.ML

    Permutation Compressors for Provably Faster Distributed Nonconvex Optimization

    Authors: Rafał Szlendak, Alexander Tyurin, Peter Richtárik

    Abstract: We study the MARINA method of Gorbunov et al (2021) -- the current state-of-the-art distributed non-convex optimization method in terms of theoretical communication complexity. Theoretical superiority of this method can be largely attributed to two sources: the use of a carefully engineered biased stochastic gradient estimator, which leads to a reduction in the number of communication rounds, and… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

    Comments: 53 pages

  14. arXiv:2106.05203  [pdf, other

    cs.LG math.OC stat.ML

    EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

    Authors: Peter Richtárik, Igor Sokolov, Ilyas Fatkhullin

    Abstract: Error feedback (EF), also known as error compensation, is an immensely popular convergence stabilization mechanism in the context of distributed training of supervised machine learning models enhanced by the use of contractive communication compression mechanisms, such as Top-$k$. First proposed by Seide et al (2014) as a heuristic, EF resisted any theoretical understanding until recently [Stich e… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: 37 pages, 5 algorithms, 3 Theorems, 8 Lemmas, 15 Figures

  15. arXiv:2102.12810  [pdf, other

    cs.LG stat.ML

    Hyperparameter Transfer Learning with Adaptive Complexity

    Authors: Samuel Horváth, Aaron Klein, Peter Richtárik, Cédric Archambeau

    Abstract: Bayesian optimization (BO) is a sample efficient approach to automatically tune the hyperparameters of machine learning models. In practice, one frequently has to solve similar hyperparameter tuning problems sequentially. For example, one might have to tune a type of neural network learned across a series of different classification problems. Recent work on multi-task BO exploits knowledge gained… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

    Comments: 12 pages, Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, San Diego, California, USA

  16. arXiv:2102.08374  [pdf, other

    cs.LG math.OC stat.ML

    IntSGD: Adaptive Floatless Compression of Stochastic Gradients

    Authors: Konstantin Mishchenko, Bokun Wang, Dmitry Kovalev, Peter Richtárik

    Abstract: We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float. This is achieved by multiplying floating-point vectors with a number known to every device and then rounding to integers. In contrast to the prior work on integer compression for SwitchML by Sapio et al. (2021), our IntSGD method is provably conver… ▽ More

    Submitted 20 March, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

    Comments: Spotlight at ICLR 2022. 27 pages, 6 figures, 3 algorithms

    Journal ref: International Conference on Learning Representations (2022)

  17. arXiv:2010.00892  [pdf, other

    cs.LG math.OC stat.ML

    Variance-Reduced Methods for Machine Learning

    Authors: Robert M. Gower, Mark Schmidt, Francis Bach, Peter Richtarik

    Abstract: Stochastic optimization lies at the heart of machine learning, and its cornerstone is stochastic gradient descent (SGD), a method introduced over 60 years ago. The last 8 years have seen an exciting new development: variance reduction (VR) for stochastic optimization methods. These VR methods excel in settings where more than one pass through the training data is allowed, achieving a faster conver… ▽ More

    Submitted 2 October, 2020; originally announced October 2020.

    Comments: 16 pages, 7 figures, 1 table

    MSC Class: 65K05; 68T99 ACM Class: G.1.6

  18. arXiv:2008.10898  [pdf, other

    cs.LG cs.AI cs.DS math.OC stat.ML

    PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

    Authors: Zhize Li, Hongyan Bao, Xiangliang Zhang, Peter Richtárik

    Abstract: In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, w… ▽ More

    Submitted 11 June, 2021; v1 submitted 25 August, 2020; originally announced August 2020.

    Comments: 25 pages; accepted by ICML 2021 (long talk)

  19. arXiv:2006.11773  [pdf, other

    math.OC stat.ML

    Optimal and Practical Algorithms for Smooth and Strongly Convex Decentralized Optimization

    Authors: Dmitry Kovalev, Adil Salim, Peter Richtárik

    Abstract: We consider the task of decentralized minimization of the sum of smooth strongly convex functions stored across the nodes of a network. For this problem, lower bounds on the number of gradient computations and the number of communication rounds required to achieve $\varepsilon$ accuracy have recently been proven. We propose two new algorithms for this decentralized optimization problem and equip t… ▽ More

    Submitted 13 November, 2020; v1 submitted 21 June, 2020; originally announced June 2020.

  20. arXiv:2006.11573  [pdf, other

    cs.LG math.OC stat.ML

    Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

    Authors: Ahmed Khaled, Othmane Sebbouh, Nicolas Loizou, Robert M. Gower, Peter Richtárik

    Abstract: We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov, Hanzely \& Richtárik (2020) and drop** the requirement that the loss function be strongly convex. Instead, we only rely on convexity of the loss function. Our unified analysis appli… ▽ More

    Submitted 20 June, 2020; originally announced June 2020.

  21. arXiv:2006.11077  [pdf, other

    cs.LG stat.ML

    A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

    Authors: Samuel Horváth, Peter Richtárik

    Abstract: Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed com… ▽ More

    Submitted 14 March, 2021; v1 submitted 19 June, 2020; originally announced June 2020.

    Comments: 10 pages, 7 figures, published as a conference paper at ICLR 2021

  22. arXiv:2006.09270  [pdf, other

    stat.ML cs.LG math.OC

    Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

    Authors: Adil Salim, Peter Richtárik

    Abstract: We consider the task of sampling with respect to a log concave probability distribution. The potential of the target distribution is assumed to be composite, \textit{i.e.}, written as the sum of a smooth convex term, and a nonsmooth convex term possibly taking infinite values. The target distribution can be seen as a minimizer of the Kullback-Leibler divergence defined on the Wasserstein space (\t… ▽ More

    Submitted 22 February, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

  23. arXiv:2006.07013  [pdf, ps, other

    math.OC cs.DS cs.LG stat.ML

    A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization

    Authors: Zhize Li, Peter Richtárik

    Abstract: In this paper, we study the performance of a large family of SGD variants in the smooth nonconvex regime. To this end, we propose a generic and flexible assumption capable of accurate modeling of the second moment of the stochastic gradient. Our assumption is satisfied by a large number of specific variants of SGD in the literature, including SGD with arbitrary sampling, SGD with compressed gradie… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

    Comments: 77 pages

  24. arXiv:2006.05988  [pdf, other

    math.OC cs.LG stat.ML

    Random Reshuffling: Simple Analysis with Vast Improvements

    Authors: Konstantin Mishchenko, Ahmed Khaled, Peter Richtárik

    Abstract: Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative gradient descent steps in conjunction with data reshuffling. Often contrasted with its sibling Stochastic Gradient Descent (SGD), RR is usually faster in practice and enjoys significant popularity in convex and non-convex optimization. The convergence rate of RR has attracted substantial attention r… ▽ More

    Submitted 5 April, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

    Comments: v3 updates: Theorem 4 includes a new result for Polyak-Lojasiewicz functions. NeurIPS 2020. 35 pages, 2 figures, 2 tables, 3 algorithms

  25. arXiv:2005.01097  [pdf, other

    cs.LG math.OC stat.ML

    Adaptive Learning of the Optimal Batch Size of SGD

    Authors: Motasem Alfarra, Slavomir Hanzely, Alyazeed Albasyoni, Bernard Ghanem, Peter Richtarik

    Abstract: Recent advances in the theoretical understanding of SGD led to a formula for the optimal batch size minimizing the number of effective data passes, i.e., the number of iterations times the batch size. However, this formula is of no practical value as it depends on the knowledge of the variance of the stochastic gradients evaluated at the optimum. In this paper we design a practical SGD method capa… ▽ More

    Submitted 19 November, 2021; v1 submitted 3 May, 2020; originally announced May 2020.

    Comments: Accepted to the 12th Annual Workshop on Optimization for Machine Learning (OPT2020)

  26. arXiv:2004.02635  [pdf, other

    math.OC cs.LG stat.ML

    Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms

    Authors: Adil Salim, Laurent Condat, Konstantin Mishchenko, Peter Richtárik

    Abstract: We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning. First, we propose a new primal-dual algorithm, which we call PDDY,… ▽ More

    Submitted 26 July, 2022; v1 submitted 3 April, 2020; originally announced April 2020.

  27. arXiv:2004.01442  [pdf, other

    cs.LG math.OC stat.ML

    From Local SGD to Local Fixed-Point Methods for Federated Learning

    Authors: Grigory Malinovsky, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, Peter Richtárik

    Abstract: Most algorithms for solving optimization problems or finding saddle points of convex-concave functions are fixed-point algorithms. In this work we consider the generic problem of finding a fixed point of an average of operators, or an approximation thereof, in a distributed setting. Our work is motivated by the needs of federated learning. In this context, each local operator models the computatio… ▽ More

    Submitted 16 June, 2020; v1 submitted 3 April, 2020; originally announced April 2020.

    Comments: Accepted to ICML 2020

  28. arXiv:2002.12410  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    On Biased Compression for Distributed Learning

    Authors: Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, Mher Safaryan

    Abstract: In the last few years, various communication compression techniques have emerged as an indispensable tool hel** to alleviate the communication bottleneck in distributed learning. However, despite the fact biased compressors often show superior performance in practice when compared to the much more studied and understood unbiased compressors, very little is known about them. In this work we study… ▽ More

    Submitted 14 January, 2024; v1 submitted 27 February, 2020; originally announced February 2020.

    Comments: 50 pages, 9 figures, 5 tables, 22 theorems and lemmas, 7 new compression operators, 1 algorithm

    Journal ref: Journal of Machine Learning Research 2023: https://www.jmlr.org/papers/v24/21-1548.html

  29. arXiv:2002.08958  [pdf, other

    cs.LG cs.DC cs.IT math.OC stat.ML

    Uncertainty Principle for Communication Compression in Distributed and Federated Learning and the Search for an Optimal Compressor

    Authors: Mher Safaryan, Egor Shulgin, Peter Richtárik

    Abstract: In order to mitigate the high communication cost in distributed and federated learning, various vector compression schemes, such as quantization, sparsification and dithering, have become very popular. In designing a compression method, one aims to communicate as few bits as possible, which minimizes the cost per communication round, while at the same time attempting to impart as little distortion… ▽ More

    Submitted 26 January, 2021; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: 23 pages, 6 figures, 2 tables

    Journal ref: Information and Inference: A Journal of the IMA, 2021

  30. arXiv:2002.05516  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Federated Learning of a Mixture of Global and Local Models

    Authors: Filip Hanzely, Peter Richtárik

    Abstract: We propose a new optimization formulation for training federated learning models. The standard formulation has the form of an empirical risk minimization problem constructed to find a single global model trained from the private data stored across all participating devices. In contrast, our formulation seeks an explicit trade-off between this traditional global model and the local models, which ca… ▽ More

    Submitted 12 February, 2021; v1 submitted 10 February, 2020; originally announced February 2020.

    Comments: 40 pages, 8 algorithms, 6 figures, 1 table (minor changes compared to the previous versions)

  31. arXiv:2002.05359  [pdf, other

    cs.LG math.OC stat.ML

    Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization

    Authors: Samuel Horváth, Lihua Lei, Peter Richtárik, Michael I. Jordan

    Abstract: Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results,… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: 11 pages, 4 Figures, 20 pages Appendix

  32. arXiv:2002.03329  [pdf, other

    math.OC cs.LG stat.ML

    Better Theory for SGD in the Nonconvex World

    Authors: Ahmed Khaled, Peter Richtárik

    Abstract: Large-scale nonconvex optimization problems are ubiquitous in modern machine learning, and among practitioners interested in solving them, Stochastic Gradient Descent (SGD) reigns supreme. We revisit the analysis of SGD in the nonconvex setting and propose a new variant of the recently introduced expected smoothness assumption which governs the behaviour of the second moment of the stochastic grad… ▽ More

    Submitted 24 July, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

    Comments: 33 pages, 3 figures, 4 theorems, and 4 propositions. V3 updates: added several references on error conditions (Tseng, Solodov, Bottou and Tsitsiklis, Grimmer), added a full proof of Corollary 1, cleaned up several proofs, and made minor adjustments to text for clarity

  33. arXiv:1912.01597  [pdf, other

    cs.LG math.OC stat.ML

    Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates

    Authors: Dmitry Kovalev, Konstantin Mishchenko, Peter Richtárik

    Abstract: We present two new remarkably simple stochastic second-order methods for minimizing the average of a very large number of sufficiently smooth and strongly convex functions. The first is a stochastic variant of Newton's method (SN), and the second is a stochastic variant of cubically regularized Newton's method (SCN). We establish local linear-quadratic convergence results. Unlike existing stochast… ▽ More

    Submitted 3 December, 2019; originally announced December 2019.

    Comments: 16 pages, 2 figures, 3 algorithms, 2 theorems, 7 lemmas; to be presented at the NeurIPS workshop "Beyond First Order Methods in ML"

  34. arXiv:1909.04746  [pdf, other

    cs.LG cs.DC math.NA math.OC stat.ML

    Tighter Theory for Local SGD on Identical and Heterogeneous Data

    Authors: Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

    Abstract: We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. T… ▽ More

    Submitted 14 April, 2022; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: AISTATS 2020. 31 pages, 1 algorithm, 5 theorems, 6 figures

  35. arXiv:1909.04716  [pdf, other

    cs.LG cs.DC math.NA math.OC stat.ML

    Gradient Descent with Compressed Iterates

    Authors: Ahmed Khaled, Peter Richtárik

    Abstract: We propose and analyze a new type of stochastic first order method: gradient descent with compressed iterates (GDCI). GDCI in each iteration first compresses the current iterate using a lossy randomized compression technique, and subsequently takes a gradient step. This method is a distillation of a key ingredient in the current practice of federated learning, where a model needs to be compressed… ▽ More

    Submitted 18 March, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality. 10 pages, 1 algorithm, 1 theorem, 5 lemmas

  36. arXiv:1909.04715  [pdf, other

    cs.LG cs.DC math.NA math.OC stat.ML

    First Analysis of Local GD on Heterogeneous Data

    Authors: Ahmed Khaled, Konstantin Mishchenko, Peter Richtárik

    Abstract: We provide the first convergence analysis of local gradient descent for minimizing the average of smooth and convex but otherwise arbitrary functions. Problems of this form and local gradient descent as a solution method are of importance in federated learning, where each function is based on private data stored by a user on a mobile device, and the data of different users can be arbitrarily heter… ▽ More

    Submitted 18 March, 2020; v1 submitted 10 September, 2019; originally announced September 2019.

    Comments: NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality. 11 pages, 4 lemmas, 1 theorem

  37. arXiv:1909.00145  [pdf, other

    eess.IV cs.LG stat.ML

    Stochastic Convolutional Sparse Coding

    Authors: **hui Xiong, Peter Richtárik, Wolfgang Heidrich

    Abstract: State-of-the-art methods for Convolutional Sparse Coding usually employ Fourier-domain solvers in order to speed up the convolution operators. However, this approach is not without shortcomings. For example, Fourier-domain representations implicitly assume circular boundary conditions and make it hard to fully exploit the sparsity of the problem as well as the small spatial support of the filters.… ▽ More

    Submitted 31 August, 2019; originally announced September 2019.

    Comments: 8 pages

  38. arXiv:1905.11768  [pdf, other

    stat.ML cs.LG math.OC math.ST

    Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates

    Authors: Adil Salim, Dmitry Kovalev, Peter Richtárik

    Abstract: We propose a new algorithm---Stochastic Proximal Langevin Algorithm (SPLA)---for sampling from a log concave distribution. Our method is a generalization of the Langevin algorithm to potentials expressed as the sum of one stochastic smooth term and multiple stochastic nonsmooth terms. In each iteration, our splitting technique only requires access to a stochastic gradient of the smooth term and a… ▽ More

    Submitted 16 June, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

    Journal ref: Neurips 2019 (Spotlight)

  39. arXiv:1905.10988  [pdf, other

    cs.LG math.OC stat.ML

    Natural Compression for Distributed Deep Learning

    Authors: Samuel Horvath, Chen-Yu Ho, Ludovit Horvath, Atal Narayan Sahu, Marco Canini, Peter Richtarik

    Abstract: Modern deep learning models are often trained in parallel over a collection of distributed machines to reduce training time. In such settings, communication of model updates among machines becomes a significant performance bottleneck and various lossy update compression techniques have been proposed to alleviate this problem. In this work, we introduce a new, simple yet theoretically and practical… ▽ More

    Submitted 5 September, 2022; v1 submitted 27 May, 2019; originally announced May 2019.

    Comments: Proceedings of 3${}^{\text{rd}}$ Annual Conference on Mathematical and Scientific Machine Learning (MSML 2022)

  40. arXiv:1903.07971  [pdf, other

    math.OC cs.LG math.NA stat.ML

    Convergence Analysis of Inexact Randomized Iterative Methods

    Authors: Nicolas Loizou, Peter Richtárik

    Abstract: In this paper we present a convergence rate analysis of inexact variants of several randomized iterative methods. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic subspace ascent. A common feature of these methods is that in their update rule a certain sub-problem needs to be solved exactly. We relax this requirement by allowin… ▽ More

    Submitted 19 March, 2019; originally announced March 2019.

    Comments: 29 pages, 4 figures, 4 tables

  41. arXiv:1903.06701  [pdf, other

    cs.DC cs.LG cs.NI stat.ML

    Scaling Distributed Machine Learning with In-Network Aggregation

    Authors: Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan R. K. Ports, Peter Richtárik

    Abstract: Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design… ▽ More

    Submitted 30 September, 2020; v1 submitted 22 February, 2019; originally announced March 2019.

  42. arXiv:1901.09997  [pdf, other

    math.OC cs.LG stat.ML

    Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample

    Authors: Albert S. Berahas, Majid Jahani, Peter Richtárik, Martin Takáč

    Abstract: We present two sampled quasi-Newton methods (sampled LBFGS and sampled LSR1) for solving empirical risk minimization problems that arise in machine learning. Contrary to the classical variants of these methods that sequentially build Hessian or inverse Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate at every iteration to… ▽ More

    Submitted 27 July, 2021; v1 submitted 28 January, 2019; originally announced January 2019.

    Comments: 50 pages, 33 figures

  43. arXiv:1901.09437  [pdf, other

    cs.LG math.OC stat.ML

    99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it

    Authors: Konstantin Mishchenko, Filip Hanzely, Peter Richtárik

    Abstract: Many popular distributed optimization methods for training machine learning models fit the following template: a local gradient estimate is computed independently by each worker, then communicated to a master, which subsequently performs averaging. The average is broadcast back to the workers, which use it to perform a gradient-type step to update the local version of the model. It is also well kn… ▽ More

    Submitted 4 June, 2019; v1 submitted 27 January, 2019; originally announced January 2019.

    Comments: 41 pages, 8 algorithms, 10 theorems, 12 figures

  44. arXiv:1901.09401  [pdf, other

    cs.LG math.OC stat.ML

    SGD: General Analysis and Improved Rates

    Authors: Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, Peter Richtarik

    Abstract: We propose a general yet simple theorem describing the convergence of SGD under the arbitrary sampling paradigm. Our theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches. This is the first time such an analysis is performed, and most of our variants of SGD w… ▽ More

    Submitted 1 May, 2019; v1 submitted 27 January, 2019; originally announced January 2019.

    Comments: 23 pages, 6 figures

    Journal ref: Proceedings of the 36th International Conference on Machine Learning, PMLR 97:5200-5209, 2019

  45. arXiv:1901.09269  [pdf, other

    cs.LG math.OC stat.ML

    Distributed Learning with Compressed Gradient Differences

    Authors: Konstantin Mishchenko, Eduard Gorbunov, Martin Takáč, Peter Richtárik

    Abstract: Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirira… ▽ More

    Submitted 28 December, 2023; v1 submitted 26 January, 2019; originally announced January 2019.

    Comments: 59 pages; Changes in V3: writing, presentation, and numerical experiments

  46. arXiv:1901.08689  [pdf, other

    cs.LG math.OC stat.ML

    Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop

    Authors: Dmitry Kovalev, Samuel Horvath, Peter Richtarik

    Abstract: The stochastic variance-reduced gradient method (SVRG) and its accelerated variant (Katyusha) have attracted enormous attention in the machine learning community in the last few years due to their superior theoretical properties and empirical behaviour on training supervised machine learning models via the empirical risk minimization paradigm. A key structural element in both of these methods is t… ▽ More

    Submitted 5 June, 2019; v1 submitted 24 January, 2019; originally announced January 2019.

    Comments: 14 pages, 2 algorithms, 9 lemmas, 2 theorems, 4 figures

  47. arXiv:1901.08669  [pdf, ps, other

    cs.LG math.OC stat.ML

    SAGA with Arbitrary Sampling

    Authors: Xu Qian, Zheng Qu, Peter Richtárik

    Abstract: We study the problem of minimizing the average of a very large number of smooth functions, which is of key importance in training supervised learning models. One of the most celebrated methods in this context is the SAGA algorithm. Despite years of research on the topic, a general-purpose version of SAGA---one that would include arbitrary importance sampling and minibatching schemes---does not exi… ▽ More

    Submitted 24 January, 2019; originally announced January 2019.

    Comments: 27 pages, 8 Figures, 1 algorithm

  48. arXiv:1802.03801  [pdf, other

    math.OC cs.LG stat.ML

    SGD and Hogwild! Convergence Without the Bounded Gradients Assumption

    Authors: Lam M. Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtárik, Katya Scheinberg, Martin Takáč

    Abstract: Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical convergence analysis of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always viol… ▽ More

    Submitted 8 June, 2018; v1 submitted 11 February, 2018; originally announced February 2018.

    Journal ref: Proceedings of the 35th International Conference on Machine Learning, PMLR 80:3747-3755, 2018

  49. arXiv:1801.05661  [pdf, other

    stat.CO

    A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments

    Authors: Radoslav Harman, Lenka Filová, Peter Richtárik

    Abstract: We propose a class of subspace ascent methods for computing optimal approximate designs that covers both existing as well as new and more efficient algorithms. Within this class of methods, we construct a simple, randomized exchange algorithm (REX). Numerical comparisons suggest that the performance of REX is comparable or superior to the performance of state-of-the-art methods across a broad rang… ▽ More

    Submitted 17 January, 2018; originally announced January 2018.

    Comments: 23 pages, 2 figures

    MSC Class: 62K05; 90C25

  50. arXiv:1712.09677  [pdf, other

    math.OC cs.LG math.NA stat.ML

    Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods

    Authors: Nicolas Loizou, Peter Richtárik

    Abstract: In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of t… ▽ More

    Submitted 28 March, 2018; v1 submitted 27 December, 2017; originally announced December 2017.

    Comments: 47 pages, 7 figures, 7 tables