Search | arXiv e-print repository

Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD

Authors: Konstantinos E. Nikolakakis, Farzin Haddadpour, Amin Karbasi, Dionysios S. Kalogerias

Abstract: We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result… ▽ More We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size). △ Less

Submitted 9 February, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

Comments: 35 pages

arXiv:2203.09607 [pdf, other]

Learning Distributionally Robust Models at Scale via Composite Optimization

Authors: Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, Amin Karbasi

Abstract: To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of… ▽ More To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of data samples -- which hinders their scalability to large datasets. In this paper, we show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods. We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: Accepted to ICLR2022 as a conference paper. International Conference on Learning Representations (2022)

arXiv:2202.06880 [pdf, ps, other]

Black-Box Generalization: Stability of Zeroth-Order Learning

Authors: Konstantinos E. Nikolakakis, Farzin Haddadpour, Dionysios S. Kalogerias, Amin Karbasi

Abstract: We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (exam… ▽ More We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well. △ Less

Submitted 9 February, 2023; v1 submitted 14 February, 2022; originally announced February 2022.

Comments: 32 pages

arXiv:2008.04975 [pdf, ps, other]

FedSKETCH: Communication-Efficient and Private Federated Learning via Sketching

Authors: Farzin Haddadpour, Belhal Karimi, ** Li, Xiaoyun Li

Abstract: Communication complexity and privacy are the two key challenges in Federated Learning where the goal is to perform a distributed learning through a large volume of devices. In this work, we introduce FedSKETCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution sett… ▽ More Communication complexity and privacy are the two key challenges in Federated Learning where the goal is to perform a distributed learning through a large volume of devices. In this work, we introduce FedSKETCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution settings respectively. The key idea is to compress the accumulation of local gradients using count sketch, therefore, the server does not have access to the gradients themselves which provides privacy. Furthermore, due to the lower dimension of sketching used, our method exhibits communication-efficiency property as well. We provide, for the aforementioned schemes, sharp convergence guarantees. Finally, we back up our theory with various set of experiments. △ Less

Submitted 11 August, 2020; originally announced August 2020.

arXiv:2007.01154 [pdf, other]

Federated Learning with Compression: Unified Analysis and Sharp Guarantees

Authors: Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari, Mehrdad Mahdavi

Abstract: In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are gradient compression and local computation… ▽ More In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are gradient compression and local computation with periodic communication. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a set of algorithms with periodical compressed (quantized or sparsified) communication and analyze their convergence properties in both homogeneous and heterogeneous local data distribution settings. For the homogeneous setting, our analysis improves existing bounds by providing tighter convergence rates for both strongly convex and non-convex objective functions. To mitigate data heterogeneity, we introduce a local gradient tracking scheme and obtain sharp convergence rates that match the best-known communication complexities without compression for convex, strongly convex, and nonconvex settings. We complement our theoretical results and demonstrate the effectiveness of our proposed methods by several experiments on real-world datasets. △ Less

Submitted 20 November, 2020; v1 submitted 2 July, 2020; originally announced July 2020.

Comments: version 2. more experiments and comparisons

arXiv:1911.04931 [pdf, other]

Efficient Fair Principal Component Analysis

Authors: Mohammad Mahdi Kamani, Farzin Haddadpour, Rana Forsati, Mehrdad Mahdavi

Abstract: It has been shown that dimension reduction methods such as PCA may be inherently prone to unfairness and treat data from different sensitive groups such as race, color, sex, etc., unfairly. In pursuit of fairness-enhancing dimensionality reduction, using the notion of Pareto optimality, we propose an adaptive first-order algorithm to learn a subspace that preserves fairness, while slightly comprom… ▽ More It has been shown that dimension reduction methods such as PCA may be inherently prone to unfairness and treat data from different sensitive groups such as race, color, sex, etc., unfairly. In pursuit of fairness-enhancing dimensionality reduction, using the notion of Pareto optimality, we propose an adaptive first-order algorithm to learn a subspace that preserves fairness, while slightly compromising the reconstruction loss. Theoretically, we provide sufficient conditions that the solution of the proposed algorithm belongs to the Pareto frontier for all sensitive groups; thereby, the optimal trade-off between overall reconstruction loss and fairness constraints is guaranteed. We also provide the convergence analysis of our algorithm and show its efficacy through empirical studies on different datasets, which demonstrates superior performance in comparison with state-of-the-art algorithms. The proposed fairness-aware PCA algorithm can be efficiently generalized to multiple group sensitive features and effectively reduce the unfairness decisions in downstream tasks such as classification. △ Less

Submitted 7 March, 2020; v1 submitted 12 November, 2019; originally announced November 2019.

arXiv:1910.14425 [pdf, other]

On the Convergence of Local Descent Methods in Federated Learning

Authors: Farzin Haddadpour, Mehrdad Mahdavi

Abstract: In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous dis… ▽ More In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous distributed optimization, to solve nonconvex optimization problems in federated learning. Although scant research is available on the effectiveness of local SGD in reducing the number of communication rounds in homogeneous setting, its convergence and communication complexity in heterogeneous setting is mostly demonstrated empirically and lacks through theoretical understating. To bridge this gap, we demonstrate that by properly analyzing the effect of unbiased gradients and sampling schema in federated setting, under mild assumptions, the implicit variance reduction feature of local distributed methods generalize to heterogeneous data shards and exhibits the best known convergence rates of homogeneous setting both in general nonconvex and under {\pl}~ condition (generalization of strong-convexity). Our theoretical results complement the recent empirical studies that demonstrate the applicability of local GD/SGD to federated learning. We also specialize the proposed local method for networked distributed optimization. To the best of our knowledge, the obtained convergence rates are the sharpest known to date on the convergence of local decant methods with periodic averaging for solving nonconvex federated optimization in both centralized and networked distributed optimization. △ Less

Submitted 6 December, 2019; v1 submitted 31 October, 2019; originally announced October 2019.

Comments: 47 pages, "Updates from v1: A technical error in Lemma B3 is corrected"

arXiv:1910.13598 [pdf, other]

Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

Authors: Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, Viveck R. Cadambe

Abstract: Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms. In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promis… ▽ More Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms. In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promising results, a theoretical understanding of its performance remains open. We strengthen convergence analysis for local SGD, and show that local SGD can be far less expensive and applied far more generally than current theory suggests. Specifically, we show that for loss functions that satisfy the Polyak-Łojasiewicz condition, $O((pT)^{1/3})$ rounds of communication suffice to achieve a linear speed up, that is, an error of $O(1/pT)$, where $T$ is the total number of model updates at each worker. This is in contrast with previous work which required higher number of communication rounds, as well as was limited to strongly convex loss functions, for a similar asymptotic performance. We also develop an adaptive synchronization scheme that provides a general condition for linear speed up. Finally, we validate the theory with experimental results, running over AWS EC2 clouds and an internal GPU cluster. △ Less

Submitted 14 May, 2020; v1 submitted 29 October, 2019; originally announced October 2019.

Comments: Paper accepted to NeurIPS 2019 - We fixed a flaw in the earlier version regarding the dependency on constants but this change does not affect the communication complexity

arXiv:1806.06140 [pdf, other]

Straggler-Resilient and Communication-Efficient Distributed Iterative Linear Solver

Authors: Farzin Haddadpour, Yaoqing Yang, Malhar Chaudhari, Viveck R Cadambe, Pulkit Grover

Abstract: We propose a novel distributed iterative linear inverse solver method. Our method, PolyLin, has significantly lower communication cost, both in terms of number of rounds as well as number of bits, in comparison with the state of the art at the cost of higher computational complexity and storage. Our algorithm also has a built-in resilience to straggling and faulty computation nodes. We develop a n… ▽ More We propose a novel distributed iterative linear inverse solver method. Our method, PolyLin, has significantly lower communication cost, both in terms of number of rounds as well as number of bits, in comparison with the state of the art at the cost of higher computational complexity and storage. Our algorithm also has a built-in resilience to straggling and faulty computation nodes. We develop a natural variant of our main algorithm that trades off communication cost for computational complexity. Our method is inspired by ideas in error correcting codes. △ Less

Submitted 15 June, 2018; originally announced June 2018.

Comments: 15 pages, 3 figures and 2 tables

arXiv:1801.10292 [pdf, other]

On the Optimal Recovery Threshold of Coded Matrix Multiplication

Authors: Sanghamitra Dutta, Mohammad Fahim, Farzin Haddadpour, Haewon Jeong, Viveck Cadambe, Pulkit Grover

Abstract: We provide novel coded computation strategies for distributed matrix-matrix products that outperform the recent "Polynomial code" constructions in recovery threshold, i.e., the required number of successful workers. When $m$-th fraction of each matrix can be stored in each worker node, Polynomial codes require $m^2$ successful workers, while our MatDot codes only require $2m-1$ successful workers,… ▽ More We provide novel coded computation strategies for distributed matrix-matrix products that outperform the recent "Polynomial code" constructions in recovery threshold, i.e., the required number of successful workers. When $m$-th fraction of each matrix can be stored in each worker node, Polynomial codes require $m^2$ successful workers, while our MatDot codes only require $2m-1$ successful workers, albeit at a higher communication cost from each worker to the fusion node. We also provide a systematic construction of MatDot codes. Further, we propose "PolyDot" coding that interpolates between Polynomial codes and MatDot codes to trade off communication cost and recovery threshold. Finally, we demonstrate a coding technique for multiplying $n$ matrices ($n \geq 3$) by applying MatDot and PolyDot coding ideas. △ Less

Submitted 16 May, 2018; v1 submitted 30 January, 2018; originally announced January 2018.

Comments: Extended version of the paper that appeared at Allerton 2017 (October 2017), including full proofs and further results. Submitted to IEEE Transactions on Information Theory

arXiv:1605.02046 [pdf, other]

Low-Complexity Stochastic Generalized Belief Propagation

Authors: Farzin Haddadpour, Mahdi Jafari Siavoshani, Morteza Noshad

Abstract: The generalized belief propagation (GBP), introduced by Yedidia et al., is an extension of the belief propagation (BP) algorithm, which is widely used in different problems involved in calculating exact or approximate marginals of probability distributions. In many problems, it has been observed that the accuracy of GBP considerably outperforms that of BP. However, because in general the computati… ▽ More The generalized belief propagation (GBP), introduced by Yedidia et al., is an extension of the belief propagation (BP) algorithm, which is widely used in different problems involved in calculating exact or approximate marginals of probability distributions. In many problems, it has been observed that the accuracy of GBP considerably outperforms that of BP. However, because in general the computational complexity of GBP is higher than BP, its application is limited in practice. In this paper, we introduce a stochastic version of GBP called stochastic generalized belief propagation (SGBP) that can be considered as an extension to the stochastic BP (SBP) algorithm introduced by Noorshams et al. They have shown that SBP reduces the complexity per iteration of BP by an order of magnitude in alphabet size. In contrast to SBP, SGBP can reduce the computation complexity if certain topological conditions are met by the region graph associated to a graphical model. However, this reduction can be larger than only one order of magnitude in alphabet size. In this paper, we characterize these conditions and the amount of computation gain that we can obtain by using SGBP. Finally, using similar proof techniques employed by Noorshams et al., for general graphical models satisfy contraction conditions, we prove the asymptotic convergence of SGBP to the unique GBP fixed point, as well as providing non-asymptotic upper bounds on the mean square error and on the high probability error. △ Less

Submitted 6 May, 2016; originally announced May 2016.

Comments: 18 pages, 11 figures, a shorter version of this paper was accepted in ISIT'16

arXiv:1305.5901 [pdf, other]

Simulation of a Channel with Another Channel

Authors: Farzin Haddadpour, Mohammad Hossein Yassaee, Salman Beigi, Amin Gohari, Mohammad Reza Aref

Abstract: In this paper, we study the problem of simulating a DMC channel from another DMC channel under an average-case and an exact model. We present several achievability and infeasibility results, with tight characterizations in special cases. In particular for the exact model, we fully characterize when a BSC channel can be simulated from a BEC channel when there is no shared randomness. We also provid… ▽ More In this paper, we study the problem of simulating a DMC channel from another DMC channel under an average-case and an exact model. We present several achievability and infeasibility results, with tight characterizations in special cases. In particular for the exact model, we fully characterize when a BSC channel can be simulated from a BEC channel when there is no shared randomness. We also provide infeasibility and achievability results for simulation of a binary channel from another binary channel in the case of no shared randomness. To do this, we use properties of Rényi capacity of a given order. We also introduce a notion of "channel diameter" which is shown to be additive and satisfy a data processing inequality. △ Less

Submitted 1 December, 2016; v1 submitted 25 May, 2013; originally announced May 2013.

Comments: 31 pages, 10 figures, and some parts of this work were published at ITW 2013

arXiv:1301.6345 [pdf, other]

On AVCs with Quadratic Constraints

Authors: Farzin Haddadpour, Mahdi Jafari Siavoshani, Mayank Bakshi, Sidharth Jaggi

Abstract: In this work we study an Arbitrarily Varying Channel (AVC) with quadratic power constraints on the transmitter and a so-called "oblivious" jammer (along with additional AWGN) under a maximum probability of error criterion, and no private randomness between the transmitter and the receiver. This is in contrast to similar AVC models under the average probability of error criterion considered in [1],… ▽ More In this work we study an Arbitrarily Varying Channel (AVC) with quadratic power constraints on the transmitter and a so-called "oblivious" jammer (along with additional AWGN) under a maximum probability of error criterion, and no private randomness between the transmitter and the receiver. This is in contrast to similar AVC models under the average probability of error criterion considered in [1], and models wherein common randomness is allowed [2] -- these distinctions are important in some communication scenarios outlined below. We consider the regime where the jammer's power constraint is smaller than the transmitter's power constraint (in the other regime it is known no positive rate is possible). For this regime we show the existence of stochastic codes (with no common randomness between the transmitter and receiver) that enables reliable communication at the same rate as when the jammer is replaced with AWGN with the same power constraint. This matches known information-theoretic outer bounds. In addition to being a stronger result than that in [1] (enabling recovery of the results therein), our proof techniques are also somewhat more direct, and hence may be of independent interest. △ Less

Submitted 27 January, 2013; originally announced January 2013.

Comments: A shorter version of this work will be send to ISIT13, Istanbul. 8 pages, 3 figures

arXiv:1203.0731 [pdf, ps, other]

Coordination via a relay

Authors: Farzin Haddadpour, Mohammad Hossein Yassaee, Amin Gohari, Mohammad Reza Aref

Abstract: In this paper, we study the problem of coordinating two nodes which can only exchange information via a relay at limited rates. The nodes are allowed to do a two-round interactive two-way communication with the relay, after which they should be able to generate i.i.d. copies of two random variables with a given joint distribution within a vanishing total variation distance. We prove inner and oute… ▽ More In this paper, we study the problem of coordinating two nodes which can only exchange information via a relay at limited rates. The nodes are allowed to do a two-round interactive two-way communication with the relay, after which they should be able to generate i.i.d. copies of two random variables with a given joint distribution within a vanishing total variation distance. We prove inner and outer bounds on the coordination capacity region for this problem. Our inner bound is proved using the technique of "output statistics of random binning" that has recently been developed by Yassaee, et al. △ Less

Submitted 4 March, 2012; originally announced March 2012.

Comments: Submitted to ISIT 2012

Showing 1–14 of 14 results for author: Haddadpour, F