-
Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD
Authors:
Konstantinos E. Nikolakakis,
Farzin Haddadpour,
Amin Karbasi,
Dionysios S. Kalogerias
Abstract:
We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result…
▽ More
We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non-Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. For nonconvex smooth losses, we prove that full-batch GD efficiently generalizes close to any stationary point at termination, and recovers the generalization error guarantees of stochastic algorithms with fewer assumptions. For smooth convex losses, we show that the generalization error is tighter than existing bounds for SGD (up to one order of error magnitude). Consequently the excess risk matches that of SGD for quadratically less iterations. Lastly, for strongly convex smooth losses, we show that full-batch GD achieves essentially the same excess risk rate as compared with the state of the art on SGD, but with an exponentially smaller number of iterations (logarithmic in the dataset size).
△ Less
Submitted 9 February, 2023; v1 submitted 26 April, 2022;
originally announced April 2022.
-
Learning Distributionally Robust Models at Scale via Composite Optimization
Authors:
Farzin Haddadpour,
Mohammad Mahdi Kamani,
Mehrdad Mahdavi,
Amin Karbasi
Abstract:
To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of…
▽ More
To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of data samples -- which hinders their scalability to large datasets. In this paper, we show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods. We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Black-Box Generalization: Stability of Zeroth-Order Learning
Authors:
Konstantinos E. Nikolakakis,
Farzin Haddadpour,
Dionysios S. Kalogerias,
Amin Karbasi
Abstract:
We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (exam…
▽ More
We provide the first generalization error analysis for black-box learning through derivative-free optimization. Under the assumption of a Lipschitz and smooth unknown loss, we consider the Zeroth-order Stochastic Search (ZoSS) algorithm, that updates a $d$-dimensional model by replacing stochastic gradient directions with stochastic differences of $K+1$ perturbed loss evaluations per dataset (example) query. For both unbounded and bounded possibly nonconvex losses, we present the first generalization bounds for the ZoSS algorithm. These bounds coincide with those for SGD, and rather surprisingly are independent of $d$, $K$ and the batch size $m$, under appropriate choices of a slightly decreased learning rate. For bounded nonconvex losses and a batch size $m=1$, we additionally show that both generalization error and learning rate are independent of $d$ and $K$, and remain essentially the same as for the SGD, even for two function evaluations. Our results extensively extend and consistently recover established results for SGD in prior work, on both generalization bounds and corresponding learning rates. If additionally $m=n$, where $n$ is the dataset size, we derive generalization guarantees for full-batch GD as well.
△ Less
Submitted 9 February, 2023; v1 submitted 14 February, 2022;
originally announced February 2022.
-
FedSKETCH: Communication-Efficient and Private Federated Learning via Sketching
Authors:
Farzin Haddadpour,
Belhal Karimi,
** Li,
Xiaoyun Li
Abstract:
Communication complexity and privacy are the two key challenges in Federated Learning where the goal is to perform a distributed learning through a large volume of devices. In this work, we introduce FedSKETCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution sett…
▽ More
Communication complexity and privacy are the two key challenges in Federated Learning where the goal is to perform a distributed learning through a large volume of devices. In this work, we introduce FedSKETCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution settings respectively. The key idea is to compress the accumulation of local gradients using count sketch, therefore, the server does not have access to the gradients themselves which provides privacy. Furthermore, due to the lower dimension of sketching used, our method exhibits communication-efficiency property as well. We provide, for the aforementioned schemes, sharp convergence guarantees.
Finally, we back up our theory with various set of experiments.
△ Less
Submitted 11 August, 2020;
originally announced August 2020.
-
Federated Learning with Compression: Unified Analysis and Sharp Guarantees
Authors:
Farzin Haddadpour,
Mohammad Mahdi Kamani,
Aryan Mokhtari,
Mehrdad Mahdavi
Abstract:
In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are gradient compression and local computation…
▽ More
In federated learning, communication cost is often a critical bottleneck to scale up distributed optimization algorithms to collaboratively learn a model from millions of devices with potentially unreliable or limited communication and heterogeneous data distributions. Two notable trends to deal with the communication overhead of federated algorithms are gradient compression and local computation with periodic communication. Despite many attempts, characterizing the relationship between these two approaches has proven elusive. We address this by proposing a set of algorithms with periodical compressed (quantized or sparsified) communication and analyze their convergence properties in both homogeneous and heterogeneous local data distribution settings. For the homogeneous setting, our analysis improves existing bounds by providing tighter convergence rates for both strongly convex and non-convex objective functions. To mitigate data heterogeneity, we introduce a local gradient tracking scheme and obtain sharp convergence rates that match the best-known communication complexities without compression for convex, strongly convex, and nonconvex settings. We complement our theoretical results and demonstrate the effectiveness of our proposed methods by several experiments on real-world datasets.
△ Less
Submitted 20 November, 2020; v1 submitted 2 July, 2020;
originally announced July 2020.
-
Efficient Fair Principal Component Analysis
Authors:
Mohammad Mahdi Kamani,
Farzin Haddadpour,
Rana Forsati,
Mehrdad Mahdavi
Abstract:
It has been shown that dimension reduction methods such as PCA may be inherently prone to unfairness and treat data from different sensitive groups such as race, color, sex, etc., unfairly. In pursuit of fairness-enhancing dimensionality reduction, using the notion of Pareto optimality, we propose an adaptive first-order algorithm to learn a subspace that preserves fairness, while slightly comprom…
▽ More
It has been shown that dimension reduction methods such as PCA may be inherently prone to unfairness and treat data from different sensitive groups such as race, color, sex, etc., unfairly. In pursuit of fairness-enhancing dimensionality reduction, using the notion of Pareto optimality, we propose an adaptive first-order algorithm to learn a subspace that preserves fairness, while slightly compromising the reconstruction loss. Theoretically, we provide sufficient conditions that the solution of the proposed algorithm belongs to the Pareto frontier for all sensitive groups; thereby, the optimal trade-off between overall reconstruction loss and fairness constraints is guaranteed. We also provide the convergence analysis of our algorithm and show its efficacy through empirical studies on different datasets, which demonstrates superior performance in comparison with state-of-the-art algorithms. The proposed fairness-aware PCA algorithm can be efficiently generalized to multiple group sensitive features and effectively reduce the unfairness decisions in downstream tasks such as classification.
△ Less
Submitted 7 March, 2020; v1 submitted 12 November, 2019;
originally announced November 2019.
-
On the Convergence of Local Descent Methods in Federated Learning
Authors:
Farzin Haddadpour,
Mehrdad Mahdavi
Abstract:
In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous dis…
▽ More
In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous distributed optimization, to solve nonconvex optimization problems in federated learning. Although scant research is available on the effectiveness of local SGD in reducing the number of communication rounds in homogeneous setting, its convergence and communication complexity in heterogeneous setting is mostly demonstrated empirically and lacks through theoretical understating. To bridge this gap, we demonstrate that by properly analyzing the effect of unbiased gradients and sampling schema in federated setting, under mild assumptions, the implicit variance reduction feature of local distributed methods generalize to heterogeneous data shards and exhibits the best known convergence rates of homogeneous setting both in general nonconvex and under {\pl}~ condition (generalization of strong-convexity). Our theoretical results complement the recent empirical studies that demonstrate the applicability of local GD/SGD to federated learning. We also specialize the proposed local method for networked distributed optimization. To the best of our knowledge, the obtained convergence rates are the sharpest known to date on the convergence of local decant methods with periodic averaging for solving nonconvex federated optimization in both centralized and networked distributed optimization.
△ Less
Submitted 6 December, 2019; v1 submitted 31 October, 2019;
originally announced October 2019.
-
Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization
Authors:
Farzin Haddadpour,
Mohammad Mahdi Kamani,
Mehrdad Mahdavi,
Viveck R. Cadambe
Abstract:
Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms. In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promis…
▽ More
Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms. In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promising results, a theoretical understanding of its performance remains open. We strengthen convergence analysis for local SGD, and show that local SGD can be far less expensive and applied far more generally than current theory suggests. Specifically, we show that for loss functions that satisfy the Polyak-Łojasiewicz condition, $O((pT)^{1/3})$ rounds of communication suffice to achieve a linear speed up, that is, an error of $O(1/pT)$, where $T$ is the total number of model updates at each worker. This is in contrast with previous work which required higher number of communication rounds, as well as was limited to strongly convex loss functions, for a similar asymptotic performance. We also develop an adaptive synchronization scheme that provides a general condition for linear speed up. Finally, we validate the theory with experimental results, running over AWS EC2 clouds and an internal GPU cluster.
△ Less
Submitted 14 May, 2020; v1 submitted 29 October, 2019;
originally announced October 2019.
-
Straggler-Resilient and Communication-Efficient Distributed Iterative Linear Solver
Authors:
Farzin Haddadpour,
Yaoqing Yang,
Malhar Chaudhari,
Viveck R Cadambe,
Pulkit Grover
Abstract:
We propose a novel distributed iterative linear inverse solver method. Our method, PolyLin, has significantly lower communication cost, both in terms of number of rounds as well as number of bits, in comparison with the state of the art at the cost of higher computational complexity and storage. Our algorithm also has a built-in resilience to straggling and faulty computation nodes. We develop a n…
▽ More
We propose a novel distributed iterative linear inverse solver method. Our method, PolyLin, has significantly lower communication cost, both in terms of number of rounds as well as number of bits, in comparison with the state of the art at the cost of higher computational complexity and storage. Our algorithm also has a built-in resilience to straggling and faulty computation nodes. We develop a natural variant of our main algorithm that trades off communication cost for computational complexity. Our method is inspired by ideas in error correcting codes.
△ Less
Submitted 15 June, 2018;
originally announced June 2018.
-
On the Optimal Recovery Threshold of Coded Matrix Multiplication
Authors:
Sanghamitra Dutta,
Mohammad Fahim,
Farzin Haddadpour,
Haewon Jeong,
Viveck Cadambe,
Pulkit Grover
Abstract:
We provide novel coded computation strategies for distributed matrix-matrix products that outperform the recent "Polynomial code" constructions in recovery threshold, i.e., the required number of successful workers. When $m$-th fraction of each matrix can be stored in each worker node, Polynomial codes require $m^2$ successful workers, while our MatDot codes only require $2m-1$ successful workers,…
▽ More
We provide novel coded computation strategies for distributed matrix-matrix products that outperform the recent "Polynomial code" constructions in recovery threshold, i.e., the required number of successful workers. When $m$-th fraction of each matrix can be stored in each worker node, Polynomial codes require $m^2$ successful workers, while our MatDot codes only require $2m-1$ successful workers, albeit at a higher communication cost from each worker to the fusion node. We also provide a systematic construction of MatDot codes. Further, we propose "PolyDot" coding that interpolates between Polynomial codes and MatDot codes to trade off communication cost and recovery threshold. Finally, we demonstrate a coding technique for multiplying $n$ matrices ($n \geq 3$) by applying MatDot and PolyDot coding ideas.
△ Less
Submitted 16 May, 2018; v1 submitted 30 January, 2018;
originally announced January 2018.
-
Low-Complexity Stochastic Generalized Belief Propagation
Authors:
Farzin Haddadpour,
Mahdi Jafari Siavoshani,
Morteza Noshad
Abstract:
The generalized belief propagation (GBP), introduced by Yedidia et al., is an extension of the belief propagation (BP) algorithm, which is widely used in different problems involved in calculating exact or approximate marginals of probability distributions. In many problems, it has been observed that the accuracy of GBP considerably outperforms that of BP. However, because in general the computati…
▽ More
The generalized belief propagation (GBP), introduced by Yedidia et al., is an extension of the belief propagation (BP) algorithm, which is widely used in different problems involved in calculating exact or approximate marginals of probability distributions. In many problems, it has been observed that the accuracy of GBP considerably outperforms that of BP. However, because in general the computational complexity of GBP is higher than BP, its application is limited in practice.
In this paper, we introduce a stochastic version of GBP called stochastic generalized belief propagation (SGBP) that can be considered as an extension to the stochastic BP (SBP) algorithm introduced by Noorshams et al. They have shown that SBP reduces the complexity per iteration of BP by an order of magnitude in alphabet size. In contrast to SBP, SGBP can reduce the computation complexity if certain topological conditions are met by the region graph associated to a graphical model. However, this reduction can be larger than only one order of magnitude in alphabet size. In this paper, we characterize these conditions and the amount of computation gain that we can obtain by using SGBP. Finally, using similar proof techniques employed by Noorshams et al., for general graphical models satisfy contraction conditions, we prove the asymptotic convergence of SGBP to the unique GBP fixed point, as well as providing non-asymptotic upper bounds on the mean square error and on the high probability error.
△ Less
Submitted 6 May, 2016;
originally announced May 2016.
-
Simulation of a Channel with Another Channel
Authors:
Farzin Haddadpour,
Mohammad Hossein Yassaee,
Salman Beigi,
Amin Gohari,
Mohammad Reza Aref
Abstract:
In this paper, we study the problem of simulating a DMC channel from another DMC channel under an average-case and an exact model. We present several achievability and infeasibility results, with tight characterizations in special cases. In particular for the exact model, we fully characterize when a BSC channel can be simulated from a BEC channel when there is no shared randomness. We also provid…
▽ More
In this paper, we study the problem of simulating a DMC channel from another DMC channel under an average-case and an exact model. We present several achievability and infeasibility results, with tight characterizations in special cases. In particular for the exact model, we fully characterize when a BSC channel can be simulated from a BEC channel when there is no shared randomness. We also provide infeasibility and achievability results for simulation of a binary channel from another binary channel in the case of no shared randomness. To do this, we use properties of Rényi capacity of a given order. We also introduce a notion of "channel diameter" which is shown to be additive and satisfy a data processing inequality.
△ Less
Submitted 1 December, 2016; v1 submitted 25 May, 2013;
originally announced May 2013.
-
On AVCs with Quadratic Constraints
Authors:
Farzin Haddadpour,
Mahdi Jafari Siavoshani,
Mayank Bakshi,
Sidharth Jaggi
Abstract:
In this work we study an Arbitrarily Varying Channel (AVC) with quadratic power constraints on the transmitter and a so-called "oblivious" jammer (along with additional AWGN) under a maximum probability of error criterion, and no private randomness between the transmitter and the receiver. This is in contrast to similar AVC models under the average probability of error criterion considered in [1],…
▽ More
In this work we study an Arbitrarily Varying Channel (AVC) with quadratic power constraints on the transmitter and a so-called "oblivious" jammer (along with additional AWGN) under a maximum probability of error criterion, and no private randomness between the transmitter and the receiver. This is in contrast to similar AVC models under the average probability of error criterion considered in [1], and models wherein common randomness is allowed [2] -- these distinctions are important in some communication scenarios outlined below.
We consider the regime where the jammer's power constraint is smaller than the transmitter's power constraint (in the other regime it is known no positive rate is possible). For this regime we show the existence of stochastic codes (with no common randomness between the transmitter and receiver) that enables reliable communication at the same rate as when the jammer is replaced with AWGN with the same power constraint. This matches known information-theoretic outer bounds. In addition to being a stronger result than that in [1] (enabling recovery of the results therein), our proof techniques are also somewhat more direct, and hence may be of independent interest.
△ Less
Submitted 27 January, 2013;
originally announced January 2013.
-
Coordination via a relay
Authors:
Farzin Haddadpour,
Mohammad Hossein Yassaee,
Amin Gohari,
Mohammad Reza Aref
Abstract:
In this paper, we study the problem of coordinating two nodes which can only exchange information via a relay at limited rates. The nodes are allowed to do a two-round interactive two-way communication with the relay, after which they should be able to generate i.i.d. copies of two random variables with a given joint distribution within a vanishing total variation distance. We prove inner and oute…
▽ More
In this paper, we study the problem of coordinating two nodes which can only exchange information via a relay at limited rates. The nodes are allowed to do a two-round interactive two-way communication with the relay, after which they should be able to generate i.i.d. copies of two random variables with a given joint distribution within a vanishing total variation distance. We prove inner and outer bounds on the coordination capacity region for this problem. Our inner bound is proved using the technique of "output statistics of random binning" that has recently been developed by Yassaee, et al.
△ Less
Submitted 4 March, 2012;
originally announced March 2012.