Search | arXiv e-print repository

Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory

Authors: Nicole Mitchell, Johannes Ballé, Zachary Charles, Jakub Konečný

Abstract: A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communicatio… ▽ More A significant bottleneck in federated learning (FL) is the network communication cost of sending model updates from client devices to the central server. We present a comprehensive empirical study of the statistics of model updates in FL, as well as the role and benefits of various compression techniques. Motivated by these observations, we propose a novel method to reduce the average communication cost, which is near-optimal in many use cases, and outperforms Top-K, DRIVE, 3LC and QSGD on Stack Overflow next-word prediction, a realistic and challenging FL benchmark. This is achieved by examining the problem using rate-distortion theory, and proposing distortion as a reliable proxy for model accuracy. Distortion can be more effectively used for optimizing the trade-off between model performance and communication cost across clients. We demonstrate empirically that in spite of the non-i.i.d. nature of federated learning, the rate-distortion frontier is consistent across datasets, optimizers, clients and training rounds. △ Less

Submitted 19 May, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

arXiv:2107.06917 [pdf, other]

A Field Guide to Federated Optimization

Authors: Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H. Brendan McMahan, Blaise Aguera y Arcas, Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, Suhas Diggavi, Hubert Eichner, Advait Gadhikar, Zachary Garrett, Antonious M. Girgis, Filip Hanzely, Andrew Hard, Chaoyang He, Samuel Horvath, Zhouyuan Huo, Alex Ingerman, Martin Jaggi, Tara Javidi, Peter Kairouz , et al. (28 additional authors not shown)

Abstract: Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and… ▽ More Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications. △ Less

Submitted 14 July, 2021; originally announced July 2021.

arXiv:2103.05032 [pdf, other]

Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning

Authors: Zachary Charles, Jakub Konečný

Abstract: We study a family of algorithms, which we refer to as local update methods, generalizing many federated and meta-learning algorithms. We prove that for quadratic models, local update methods are equivalent to first-order optimization on a surrogate loss we exactly characterize. Moreover, fundamental algorithmic choices (such as learning rates) explicitly govern a trade-off between the condition nu… ▽ More We study a family of algorithms, which we refer to as local update methods, generalizing many federated and meta-learning algorithms. We prove that for quadratic models, local update methods are equivalent to first-order optimization on a surrogate loss we exactly characterize. Moreover, fundamental algorithmic choices (such as learning rates) explicitly govern a trade-off between the condition number of the surrogate loss and its alignment with the true loss. We derive novel convergence rates showcasing these trade-offs and highlight their importance in communication-limited settings. Using these insights, we are able to compare local update methods based on their convergence/accuracy trade-off, not just their convergence to critical points of the empirical loss. Our results shed new light on a broad range of phenomena, including the efficacy of server momentum in federated learning and the impact of proximal client updates. △ Less

Submitted 8 March, 2021; originally announced March 2021.

Journal ref: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021. PMLR: Volume 130

arXiv:2011.04928 [pdf, ps, other]

LinCbO: fast algorithm for computation of the Duquenne-Guigues basis

Authors: Radek Janostik, Jan Konecny, Petr Krajča

Abstract: We propose and evaluate a novel algorithm for computation of the Duquenne-Guigues basis which combines Close-by-One and LinClosure algorithms. This combination enables us to reuse attribute counters used in LinClosure and speed up the computation. Our experimental evaluation shows that it is the most efficient algorithm for computation of the Duquenne-Guigues basis. We propose and evaluate a novel algorithm for computation of the Duquenne-Guigues basis which combines Close-by-One and LinClosure algorithms. This combination enables us to reuse attribute counters used in LinClosure and speed up the computation. Our experimental evaluation shows that it is the most efficient algorithm for computation of the Duquenne-Guigues basis. △ Less

Submitted 22 January, 2021; v1 submitted 10 November, 2020; originally announced November 2020.

ACM Class: F.2.2

arXiv:2010.06980 [pdf, other]

LCM from FCA Point of View: A CbO-style Algorithm with Speed-up Features

Authors: Radek Janostik, Jan Konecny, Petr Krajča

Abstract: LCM is an algorithm for enumeration of frequent closed itemsets in transaction databases. It is well known that when we ignore the required frequency, the closed itemsets are exactly intents of formal concepts in Formal Concept Analysis (FCA). We describe LCM in terms of FCA and show that LCM is basically the Close-by-One algorithm with multiple speed-up features for processing sparse data. We ana… ▽ More LCM is an algorithm for enumeration of frequent closed itemsets in transaction databases. It is well known that when we ignore the required frequency, the closed itemsets are exactly intents of formal concepts in Formal Concept Analysis (FCA). We describe LCM in terms of FCA and show that LCM is basically the Close-by-One algorithm with multiple speed-up features for processing sparse data. We analyze the speed-up features and compare them with those of similar FCA algorithms, like FCbO and algorithms from the In-Close family. △ Less

Submitted 22 January, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

Comments: full version of a conference paper to be published in IJAR

ACM Class: F.2.2

arXiv:2007.00878 [pdf, other]

On the Outsized Importance of Learning Rates in Local Update Methods

Authors: Zachary Charles, Jakub Konečný

Abstract: We study a family of algorithms, which we refer to as local update methods, that generalize many federated learning and meta-learning algorithms. We prove that for quadratic objectives, local update methods perform stochastic gradient descent on a surrogate loss function which we exactly characterize. We show that the choice of client learning rate controls the condition number of that surrogate l… ▽ More We study a family of algorithms, which we refer to as local update methods, that generalize many federated learning and meta-learning algorithms. We prove that for quadratic objectives, local update methods perform stochastic gradient descent on a surrogate loss function which we exactly characterize. We show that the choice of client learning rate controls the condition number of that surrogate loss, as well as the distance between the minimizers of the surrogate and true loss functions. We use this theory to derive novel convergence rates for federated averaging that showcase this trade-off between the condition number of the surrogate loss and its alignment with the true loss function. We validate our results empirically, showing that in communication-limited settings, proper learning rate tuning is often sufficient to reach near-optimal behavior. We also present a practical method for automatic learning rate decay in local update methods that helps reduce the need for learning rate tuning, and highlight its empirical performance on a variety of tasks and datasets. △ Less

Submitted 2 July, 2020; originally announced July 2020.

arXiv:2003.00295 [pdf, other]

Adaptive Federated Optimization

Authors: Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný, Sanjiv Kumar, H. Brendan McMahan

Abstract: Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have… ▽ More Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FedAvg) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including Adagrad, Adam, and Yogi, and analyze their convergence in the presence of heterogeneous data for general non-convex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning. △ Less

Submitted 8 September, 2021; v1 submitted 29 February, 2020; originally announced March 2020.

Comments: Published as a conference paper at ICLR 2021

arXiv:1912.04977 [pdf, other]

Advances and Open Problems in Federated Learning

Authors: Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D'Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson , et al. (34 additional authors not shown)

Abstract: Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while kee** the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re… ▽ More Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while kee** the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges. △ Less

Submitted 8 March, 2021; v1 submitted 10 December, 2019; originally announced December 2019.

Comments: Published in Foundations and Trends in Machine Learning Vol 4 Issue 1. See: https://www.nowpublishers.com/article/Details/MAL-083

arXiv:1912.00131 [pdf, other]

Federated Learning with Autotuned Communication-Efficient Secure Aggregation

Authors: Keith Bonawitz, Fariborz Salehi, Jakub Konečný, Brendan McMahan, Marco Gruteser

Abstract: Federated Learning enables mobile devices to collaboratively learn a shared inference model while kee** all the training data on a user's device, decoupling the ability to do machine learning from the need to store the data in the cloud. Existing work on federated learning with limited communication demonstrates how random rotation can enable users' model updates to be quantized much more effici… ▽ More Federated Learning enables mobile devices to collaboratively learn a shared inference model while kee** all the training data on a user's device, decoupling the ability to do machine learning from the need to store the data in the cloud. Existing work on federated learning with limited communication demonstrates how random rotation can enable users' model updates to be quantized much more efficiently, reducing the communication cost between users and the server. Meanwhile, secure aggregation enables the server to learn an aggregate of at least a threshold number of device's model contributions without observing any individual device's contribution in unaggregated form. In this paper, we highlight some of the challenges of setting the parameters for secure aggregation to achieve communication efficiency, especially in the context of the aggressively quantized inputs enabled by random rotation. We then develop a recipe for auto-tuning communication-efficient secure aggregation, based on specific properties of random rotation and secure aggregation -- namely, the predictable distribution of vector entries post-rotation and the modular wrap** inherent in secure aggregation. We present both theoretical results and initial experiments. △ Less

Submitted 29 November, 2019; originally announced December 2019.

Comments: 5 pages, 3 figures. To appear at the IEEE Asilomar Conference on Signals, Systems, and Computers 2019

arXiv:1909.12488 [pdf, other]

Improving Federated Learning Personalization via Model Agnostic Meta Learning

Authors: Yihan Jiang, Jakub Konečný, Keith Rush, Sreeram Kannan

Abstract: Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this… ▽ More Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this work, we point out that the setting of Model Agnostic Meta Learning (MAML), where one optimizes for a fast, gradient-based, few-shot adaptation to a heterogeneous distribution of tasks, has a number of similarities with the objective of personalization for FL. We present FL as a natural source of practical applications for MAML algorithms, and make the following observations. 1) The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. 2) Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 3) A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. These results raise new questions for FL, MAML, and broader ML research. △ Less

Submitted 18 January, 2023; v1 submitted 27 September, 2019; originally announced September 2019.

arXiv:1904.03257 [pdf, ps, other]

MLSys: The New Frontier of Machine Learning Systems

Authors: Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung, Bill Dally, Jeff Dean, Inderjit S. Dhillon, Alexandros Dimakis, Pradeep Dubey, Charles Elkan, Grigori Fursin, Gregory R. Ganger, Lise Getoor, Phillip B. Gibbons, Garth A. Gibson, Joseph E. Gonzalez, Justin Gottschlich, Song Han, Kim Hazelwood , et al. (44 additional authors not shown)

Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a ne… ▽ More Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two. △ Less

Submitted 1 December, 2019; v1 submitted 29 March, 2019; originally announced April 2019.

arXiv:1902.01046 [pdf, other]

Towards Federated Learning at Scale: System Design

Authors: Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander

Abstract: Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralized data. We have built a scalable production system for Federated Learning in the domain of mobile devices, based on TensorFlow. In this paper, we describe the resulting high-level design, sketch some of the challenges and their solutions, and touch upon the open problems and… ▽ More Federated Learning is a distributed machine learning approach which enables model training on a large corpus of decentralized data. We have built a scalable production system for Federated Learning in the domain of mobile devices, based on TensorFlow. In this paper, we describe the resulting high-level design, sketch some of the challenges and their solutions, and touch upon the open problems and future directions. △ Less

Submitted 22 March, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

arXiv:1901.09367 [pdf, other]

A Privacy Preserving Randomized Gossip Algorithm via Controlled Noise Insertion

Authors: Filip Hanzely, Jakub Konečný, Nicolas Loizou, Peter Richtárik, Dmitry Grishchenko

Abstract: In this work we present a randomized gossip algorithm for solving the average consensus problem while at the same time protecting the information about the initial private values stored at the nodes. We give iteration complexity bounds for the method and perform extensive numerical experiments. In this work we present a randomized gossip algorithm for solving the average consensus problem while at the same time protecting the information about the initial private values stored at the nodes. We give iteration complexity bounds for the method and perform extensive numerical experiments. △ Less

Submitted 27 January, 2019; originally announced January 2019.

Comments: NeurIPS 2018, Privacy Preserving Machine Learning Workshop (camera ready version). The full-length paper, which includes a number of additional algorithms and results (including proofs of statements and experiments), is available in arXiv:1706.07636

arXiv:1812.07210 [pdf, other]

Expanding the Reach of Federated Learning by Reducing Client Resource Requirements

Authors: Sebastian Caldas, Jakub Konečny, H. Brendan McMahan, Ameet Talwalkar

Abstract: Communication on heterogeneous edge networks is a fundamental bottleneck in Federated Learning (FL), restricting both model capacity and user participation. To address this issue, we introduce two novel strategies to reduce communication costs: (1) the use of lossy compression on the global model sent server-to-client; and (2) Federated Dropout, which allows users to efficiently train locally on s… ▽ More Communication on heterogeneous edge networks is a fundamental bottleneck in Federated Learning (FL), restricting both model capacity and user participation. To address this issue, we introduce two novel strategies to reduce communication costs: (1) the use of lossy compression on the global model sent server-to-client; and (2) Federated Dropout, which allows users to efficiently train locally on smaller subsets of the global model and also provides a reduction in both client-to-server communication and local computation. We empirically show that these strategies, combined with existing compression approaches for client-to-server communication, collectively provide up to a $14\times$ reduction in server-to-client communication, a $1.7\times$ reduction in local computation, and a $28\times$ reduction in upload communication, all without degrading the quality of the final model. We thus comprehensively reduce FL's impact on client device resources, allowing higher capacity models to be trained, and a more diverse set of users to be reached. △ Less

Submitted 8 January, 2019; v1 submitted 18 December, 2018; originally announced December 2018.

arXiv:1812.01097 [pdf, other]

LEAF: A Benchmark for Federated Settings

Authors: Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, Ameet Talwalkar

Abstract: Modern federated networks, such as those comprised of wearable devices, mobile phones, or autonomous vehicles, generate massive amounts of data each day. This wealth of data can help to learn models that can improve the user experience on each device. However, the scale and heterogeneity of federated data presents new challenges in research areas such as federated learning, meta-learning, and mult… ▽ More Modern federated networks, such as those comprised of wearable devices, mobile phones, or autonomous vehicles, generate massive amounts of data each day. This wealth of data can help to learn models that can improve the user experience on each device. However, the scale and heterogeneity of federated data presents new challenges in research areas such as federated learning, meta-learning, and multi-task learning. As the machine learning community begins to tackle these challenges, we are at a critical time to ensure that developments made in these areas are grounded with realistic benchmarks. To this end, we propose LEAF, a modular benchmarking framework for learning in federated settings. LEAF includes a suite of open-source federated datasets, a rigorous evaluation framework, and a set of reference implementations, all geared towards capturing the obstacles and intricacies of practical federated environments. △ Less

Submitted 9 December, 2019; v1 submitted 3 December, 2018; originally announced December 2018.

arXiv:1711.05509 [pdf, ps, other]

Note on Representing attribute reduction and concepts in concepts lattice using graphs

Authors: Jan Konecny

Abstract: Mao H. (2017, Representing attribute reduction and concepts in concept lattice using graphs. Soft Computing 21(24):7293--7311) claims to make contributions to the study of reduction of attributes in concept lattices by using graph theory. We show that her results are either trivial or already well-known and all three algorithms proposed in the paper are incorrect. Mao H. (2017, Representing attribute reduction and concepts in concept lattice using graphs. Soft Computing 21(24):7293--7311) claims to make contributions to the study of reduction of attributes in concept lattices by using graph theory. We show that her results are either trivial or already well-known and all three algorithms proposed in the paper are incorrect. △ Less

Submitted 30 May, 2018; v1 submitted 15 November, 2017; originally announced November 2017.

Comments: 10 pages, 5 figures

arXiv:1707.01155 [pdf, other]

Stochastic, Distributed and Federated Optimization for Machine Learning

Authors: Jakub Konečný

Abstract: We study optimization algorithms for the finite sum problems frequently arising in machine learning applications. First, we propose novel variants of stochastic gradient descent with a variance reduction property that enables linear convergence for strongly convex objectives. Second, we study distributed setting, in which the data describing the optimization problem does not fit into a single comp… ▽ More We study optimization algorithms for the finite sum problems frequently arising in machine learning applications. First, we propose novel variants of stochastic gradient descent with a variance reduction property that enables linear convergence for strongly convex objectives. Second, we study distributed setting, in which the data describing the optimization problem does not fit into a single computing node. In this case, traditional methods are inefficient, as the communication costs inherent in distributed optimization become the bottleneck. We propose a communication-efficient framework which iteratively forms local subproblems that can be solved with arbitrary local optimization algorithms. Finally, we introduce the concept of Federated Optimization/Learning, where we try to solve the machine learning problems without having data stored in any centralized manner. The main motivation comes from industry when handling user-generated data. The current prevalent practice is that companies collect vast amounts of user data and store them in datacenters. An alternative we propose is not to collect the data in first place, and instead occasionally use the computational power of users' devices to solve the very same optimization problems, while alleviating privacy concerns at the same time. In such setting, minimization of communication rounds is the primary goal, and we demonstrate that solving the optimization problems in such circumstances is conceptually tractable. △ Less

Submitted 4 July, 2017; originally announced July 2017.

Comments: PhD thesis

arXiv:1611.07555 [pdf, other]

Randomized Distributed Mean Estimation: Accuracy vs Communication

Authors: Jakub Konečný, Peter Richtárik

Abstract: We consider the problem of estimating the arithmetic average of a finite collection of real vectors stored in a distributed fashion across several compute nodes subject to a communication budget constraint. Our analysis does not rely on any statistical assumptions about the source of the vectors. This problem arises as a subproblem in many applications, including reduce-all operations within algor… ▽ More We consider the problem of estimating the arithmetic average of a finite collection of real vectors stored in a distributed fashion across several compute nodes subject to a communication budget constraint. Our analysis does not rely on any statistical assumptions about the source of the vectors. This problem arises as a subproblem in many applications, including reduce-all operations within algorithms for distributed and federated optimization and learning. We propose a flexible family of randomized algorithms exploring the trade-off between expected communication cost and estimation error. Our family contains the full-communication and zero-error method on one extreme, and an $ε$-bit communication and ${\cal O}\left(1/(εn)\right)$ error method on the opposite extreme. In the special case where we communicate, in expectation, a single bit per coordinate of each vector, we improve upon existing results by obtaining $\mathcal{O}(r/n)$ error, where $r$ is the number of bits used to represent a floating point value. △ Less

Submitted 22 November, 2016; originally announced November 2016.

Comments: 19 pages, 1 figure

arXiv:1610.05492 [pdf, other]

Federated Learning: Strategies for Improving Communication Efficiency

Authors: Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, Dave Bacon

Abstract: Federated Learning is a machine learning setting where the goal is to train a high-quality centralized model while training data remains distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local dat… ▽ More Federated Learning is a machine learning setting where the goal is to train a high-quality centralized model while training data remains distributed over a large number of clients each with unreliable and relatively slow network connections. We consider learning algorithms for this setting where on each round, each client independently computes an update to the current model based on its local data, and communicates this update to a central server, where the client-side updates are aggregated to compute a new global model. The typical clients in this setting are mobile phones, and communication efficiency is of the utmost importance. In this paper, we propose two ways to reduce the uplink communication costs: structured updates, where we directly learn an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, where we learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling before sending it to the server. Experiments on both convolutional and recurrent networks show that the proposed methods can reduce the communication cost by two orders of magnitude. △ Less

Submitted 30 October, 2017; v1 submitted 18 October, 2016; originally announced October 2016.

arXiv:1610.02527 [pdf, other]

Federated Optimization: Distributed Machine Learning for On-Device Intelligence

Authors: Jakub Konečný, H. Brendan McMahan, Daniel Ramage, Peter Richtárik

Abstract: We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes. The goal is to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of the utmost importance and minimizin… ▽ More We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number of nodes. The goal is to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of the utmost importance and minimizing the number of rounds of communication is the principal goal. A motivating example arises when we keep the training data locally on users' mobile devices instead of logging it to a data center for training. In federated optimziation, the devices are used as compute nodes performing computation on their local data in order to update a global model. We suppose that we have extremely large number of devices in the network --- as many as the number of users of a given service, each of which has only a tiny fraction of the total data available. In particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, it is reasonable to assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results for sparse convex problems. This work also sets a path for future research needed in the context of \federated optimization. △ Less

Submitted 8 October, 2016; originally announced October 2016.

Comments: 38 pages

arXiv:1608.06879 [pdf, other]

AIDE: Fast and Communication Efficient Distributed Optimization

Authors: Sashank J. Reddi, Jakub Konečný, Peter Richtárik, Barnabás Póczós, Alex Smola

Abstract: In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be… ▽ More In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be viewed as a robustification strategy since the method is substantially better behaved than DANE on data partition arising in practice. It is well known that DANE algorithm does not match the communication complexity lower bounds. To bridge this gap, we propose an accelerated variant of the first method, called AIDE, that not only matches the communication lower bounds but can also be implemented using a purely first-order oracle. Our empirical results show that AIDE is superior to other communication efficient algorithms in settings that naturally arise in machine learning applications. △ Less

Submitted 24 August, 2016; originally announced August 2016.

arXiv:1512.04039 [pdf, other]

Distributed Optimization with Arbitrary Local Solvers

Authors: Chenxin Ma, Jakub Konečný, Martin Jaggi, Virginia Smith, Michael I. Jordan, Peter Richtárik, Martin Takáč

Abstract: With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on develo** highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performanc… ▽ More With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on develo** highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains. △ Less

Submitted 3 August, 2016; v1 submitted 13 December, 2015; originally announced December 2015.

arXiv:1511.03575 [pdf, ps, other]

Federated Optimization:Distributed Optimization Beyond the Datacenter

Authors: Jakub Konečný, Brendan McMahan, Daniel Ramage

Abstract: We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of \nodes, but the goal remains to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of utmost importance. A… ▽ More We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large number of \nodes, but the goal remains to train a high-quality centralized model. We refer to this setting as Federated Optimization. In this setting, communication efficiency is of utmost importance. A motivating example for federated optimization arises when we keep the training data locally on users' mobile devices rather than logging it to a data center for training. Instead, the mobile devices are used as nodes performing computation on their local data in order to update a global model. We suppose that we have an extremely large number of devices in our network, each of which has only a tiny fraction of data available totally; in particular, we expect the number of data points available locally to be much smaller than the number of devices. Additionally, since different users generate data with different patterns, we assume that no device has a representative sample of the overall distribution. We show that existing algorithms are not suitable for this setting, and propose a new algorithm which shows encouraging experimental results. This work also sets a path for future research needed in the context of federated optimization. △ Less

Submitted 11 November, 2015; originally announced November 2015.

Comments: NIPS workshop version

arXiv:1511.01942 [pdf, other]

Stop Wasting My Gradients: Practical SVRG

Authors: Reza Babanezhad, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečný, Scott Sallinen

Abstract: We present and analyze several strategies for improving the performance of stochastic variance-reduced gradient (SVRG) methods. We first show that the convergence rate of these methods can be preserved under a decreasing sequence of errors in the control variate, and use this to derive variants of SVRG that use growing-batch strategies to reduce the number of gradient calculations required in the… ▽ More We present and analyze several strategies for improving the performance of stochastic variance-reduced gradient (SVRG) methods. We first show that the convergence rate of these methods can be preserved under a decreasing sequence of errors in the control variate, and use this to derive variants of SVRG that use growing-batch strategies to reduce the number of gradient calculations required in the early iterations. We further (i) show how to exploit support vectors to reduce the number of gradient computations in the later iterations, (ii) prove that the commonly-used regularized SVRG iteration is justified and improves the convergence rate, (iii) consider alternate mini-batch selection strategies, and (iv) consider the generalization error of the method. △ Less

Submitted 5 November, 2015; originally announced November 2015.

arXiv:1506.03930 [pdf, ps, other]

Complete relations on fuzzy complete lattices

Authors: Jan Konecny, Michal Krupka

Abstract: We generalize the notion of complete binary relation on complete lattice to residuated lattice valued ordered sets and show its properties. Then we focus on complete fuzzy tolerances on fuzzy complete lattices and prove they are in one-to-one correspondence with extensive isotone Galois connections. Finally, we prove that fuzzy complete lattice, factorized by a complete fuzzy tolerance, is again a… ▽ More We generalize the notion of complete binary relation on complete lattice to residuated lattice valued ordered sets and show its properties. Then we focus on complete fuzzy tolerances on fuzzy complete lattices and prove they are in one-to-one correspondence with extensive isotone Galois connections. Finally, we prove that fuzzy complete lattice, factorized by a complete fuzzy tolerance, is again a fuzzy complete lattice. △ Less

Submitted 12 June, 2015; originally announced June 2015.

Comments: Preprint submitted to Fuzzy Sets and Systems

arXiv:1504.04407 [pdf, other]

doi 10.1109/JSTSP.2015.2505682

Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

Authors: Jakub Konečný, Jie Liu, Peter Richtárik, Martin Takáč

Abstract: We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a determ… ▽ More We propose mS2GD: a method incorporating a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent (S2GD). We consider the problem of minimizing a strongly convex function represented as the sum of an average of a large number of smooth convex functions, and a simple nonsmooth convex regularizer. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample $b$ functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that it benefits from two speedup effects. First, we prove that as long as $b$ is below a certain threshold, we can reach any predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization. △ Less

Submitted 16 November, 2015; v1 submitted 16 April, 2015; originally announced April 2015.

arXiv:1410.4744 [pdf, other]

mS2GD: Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

Authors: Jakub Konečný, Jie Liu, Peter Richtárik, Martin Takáč

Abstract: We propose a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent applied to the problem of minimizing a strongly convex composite function represented as the sum of an average of a large number of smooth convex functions, and simple nonsmooth convex function. Our method first performs a deterministic step (computation of the g… ▽ More We propose a mini-batching scheme for improving the theoretical complexity and practical performance of semi-stochastic gradient descent applied to the problem of minimizing a strongly convex composite function represented as the sum of an average of a large number of smooth convex functions, and simple nonsmooth convex function. Our method first performs a deterministic step (computation of the gradient of the objective function at the starting point), followed by a large number of stochastic steps. The process is repeated a few times with the last iterate becoming the new starting point. The novelty of our method is in introduction of mini-batching into the computation of stochastic steps. In each step, instead of choosing a single function, we sample $b$ functions, compute their gradients, and compute the direction based on this. We analyze the complexity of the method and show that the method benefits from two speedup effects. First, we prove that as long as $b$ is below a certain threshold, we can reach predefined accuracy with less overall work than without mini-batching. Second, our mini-batching scheme admits a simple parallel implementation, and hence is suitable for further acceleration by parallelization. △ Less

Submitted 17 October, 2014; originally announced October 2014.

arXiv:1410.0390 [pdf, ps, other]

Simple Complexity Analysis of Simplified Direct Search

Authors: Jakub Konečný, Peter Richtárik

Abstract: We consider the problem of unconstrained minimization of a smooth function in the derivative-free setting using. In particular, we propose and study a simplified variant of the direct search method (of direction type), which we call simplified direct search (SDS). Unlike standard direct search methods, which depend on a large number of parameters that need to be tuned, SDS depends on a single scal… ▽ More We consider the problem of unconstrained minimization of a smooth function in the derivative-free setting using. In particular, we propose and study a simplified variant of the direct search method (of direction type), which we call simplified direct search (SDS). Unlike standard direct search methods, which depend on a large number of parameters that need to be tuned, SDS depends on a single scalar parameter only. Despite relevant research activity in direct search methods spanning several decades, complexity guarantees---bounds on the number of function evaluations needed to find an approximate solution---were not established until very recently. In this paper we give a surprisingly brief and unified analysis of SDS for nonconvex, convex and strongly convex functions. We match the existing complexity results for direct search in their dependence on the problem dimension ($n$) and error tolerance ($ε$), but the overall bounds are simpler, easier to interpret, and have better dependence on other problem parameters. In particular, we show that for the set of directions formed by the standard coordinate vectors and their negatives, the number of function evaluations needed to find an $ε$-solution is $O(n^2 /ε)$ (resp. $O(n^2 \log(1/ε))$) for the problem of minimizing a convex (resp. strongly convex) smooth function. In the nonconvex smooth case, the bound is $O(n^2/ε^2)$, with the goal being the reduction of the norm of the gradient below $ε$. △ Less

Submitted 13 November, 2014; v1 submitted 1 October, 2014; originally announced October 2014.

Comments: 21 pages, 5 algorithms, 1 table

arXiv:1312.4190 [pdf, other]

One-Shot-Learning Gesture Recognition using HOG-HOF Features

Authors: Jakub Konečný, Michal Hagara

Abstract: The purpose of this paper is to describe one-shot-learning gesture recognition systems developed on the \textit{ChaLearn Gesture Dataset}. We use RGB and depth images and combine appearance (Histograms of Oriented Gradients) and motion descriptors (Histogram of Optical Flow) for parallel temporal segmentation and recognition. The Quadratic-Chi distance family is used to measure differences between… ▽ More The purpose of this paper is to describe one-shot-learning gesture recognition systems developed on the \textit{ChaLearn Gesture Dataset}. We use RGB and depth images and combine appearance (Histograms of Oriented Gradients) and motion descriptors (Histogram of Optical Flow) for parallel temporal segmentation and recognition. The Quadratic-Chi distance family is used to measure differences between histograms to capture cross-bin relationships. We also propose a new algorithm for trimming videos --- to remove all the unimportant frames from videos. We present two methods that use combination of HOG-HOF descriptors together with variants of Dynamic Time War** technique. Both methods outperform other published methods and help narrow down the gap between human performance and algorithms on this task. The code has been made publicly available in the MLOSS repository. △ Less

Submitted 15 February, 2014; v1 submitted 15 December, 2013; originally announced December 2013.

Comments: 20 pages, 10 figures, 2 tables To appear in Journal of Machine Learning Research subject to minor revision

arXiv:1312.1666 [pdf, other]

Semi-Stochastic Gradient Descent Methods

Authors: Jakub Konečný, Peter Richtárik

Abstract: In this paper we study the problem of minimizing the average of a large number ($n$) of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. The total work needed for the method to output an… ▽ More In this paper we study the problem of minimizing the average of a large number ($n$) of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. The total work needed for the method to output an $\varepsilon$-accurate solution in expectation, measured in the number of passes over data, or equivalently, in units equivalent to the computation of a single gradient of the loss, is $O((κ/n)\log(1/\varepsilon))$, where $κ$ is the condition number. This is achieved by running the method for $O(\log(1/\varepsilon))$ epochs, with a single gradient evaluation and $O(κ)$ stochastic gradient evaluations in each. The SVRG method of Johnson and Zhang arises as a special case. If our method is limited to a single epoch only, it needs to evaluate at most $O((κ/\varepsilon)\log(1/\varepsilon))$ stochastic gradients. In contrast, SVRG requires $O(κ/\varepsilon^2)$ stochastic gradients. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find an $10^{-6}$-accurate solution for a problem with $n=10^9$ and $κ=10^3$. △ Less

Submitted 16 June, 2015; v1 submitted 5 December, 2013; originally announced December 2013.

Comments: 19 pages, 3 figures, 2 algorithms, 3 tables

Showing 1–30 of 30 results for author: Konečný, J