Search | arXiv e-print repository

Accelerated Stochastic Min-Max Optimization Based on Bias-corrected Momentum

Authors: Haoyuan Cai, Sulaiman A. Alghunaim, Ali H. Sayed

Abstract: Lower-bound analyses for nonconvex strongly-concave minimax optimization problems have shown that stochastic first-order algorithms require at least $\mathcal{O}(\varepsilon^{-4})$ oracle complexity to find an $\varepsilon$-stationary point. Some works indicate that this complexity can be improved to $\mathcal{O}(\varepsilon^{-3})$ when the loss gradient is Lipschitz continuous. The question of ac… ▽ More Lower-bound analyses for nonconvex strongly-concave minimax optimization problems have shown that stochastic first-order algorithms require at least $\mathcal{O}(\varepsilon^{-4})$ oracle complexity to find an $\varepsilon$-stationary point. Some works indicate that this complexity can be improved to $\mathcal{O}(\varepsilon^{-3})$ when the loss gradient is Lipschitz continuous. The question of achieving enhanced convergence rates under distinct conditions, remains unresolved. In this work, we address this question for optimization problems that are nonconvex in the minimization variable and strongly concave or Polyak-Lojasiewicz (PL) in the maximization variable. We introduce novel bias-corrected momentum algorithms utilizing efficient Hessian-vector products. We establish convergence conditions and demonstrate a lower iteration complexity of $\mathcal{O}(\varepsilon^{-3})$ for the proposed algorithms. The effectiveness of the method is validated through applications to robust logistic regression using real-world datasets. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2401.14585 [pdf, other]

Diffusion Stochastic Optimization for Min-Max Problems

Authors: Haoyuan Cai, Sulaiman A. Alghunaim, Ali H. Sayed

Abstract: The optimistic gradient method is useful in addressing minimax optimization problems. Motivated by the observation that the conventional stochastic version suffers from the need for a large batch size on the order of $\mathcal{O}(\varepsilon^{-2})$ to achieve an $\varepsilon$-stationary solution, we introduce and analyze a new formulation termed Diffusion Stochastic Same-Sample Optimistic Gradient… ▽ More The optimistic gradient method is useful in addressing minimax optimization problems. Motivated by the observation that the conventional stochastic version suffers from the need for a large batch size on the order of $\mathcal{O}(\varepsilon^{-2})$ to achieve an $\varepsilon$-stationary solution, we introduce and analyze a new formulation termed Diffusion Stochastic Same-Sample Optimistic Gradient (DSS-OG). We prove its convergence and resolve the large batch issue by establishing a tighter upper bound, under the more general setting of nonconvex Polyak-Lojasiewicz (PL) risk functions. We also extend the applicability of the proposed method to the distributed scenario, where agents communicate with their neighbors via a left-stochastic protocol. To implement DSS-OG, we can query the stochastic gradient oracles in parallel with some extra memory overhead, resulting in a complexity comparable to its conventional counterpart. To demonstrate the efficacy of the proposed algorithm, we conduct tests by training generative adversarial networks. △ Less

Submitted 25 January, 2024; originally announced January 2024.

arXiv:2310.07983 [pdf, other]

Revisiting Decentralized ProxSkip: Achieving Linear Speedup

Authors: Luyao Guo, Sulaiman A. Alghunaim, Kun Yuan, Laurent Condat, **de Cao

Abstract: The ProxSkip algorithm for decentralized and federated learning is gaining increasing attention due to its proven benefits in accelerating communication complexity while maintaining robustness against data heterogeneity. However, existing analyses of ProxSkip are limited to the strongly convex setting and do not achieve linear speedup, where convergence performance increases linearly with respect… ▽ More The ProxSkip algorithm for decentralized and federated learning is gaining increasing attention due to its proven benefits in accelerating communication complexity while maintaining robustness against data heterogeneity. However, existing analyses of ProxSkip are limited to the strongly convex setting and do not achieve linear speedup, where convergence performance increases linearly with respect to the number of nodes. So far, questions remain open about how ProxSkip behaves in the non-convex setting and whether linear speedup is achievable. In this paper, we revisit decentralized ProxSkip and address both questions. We demonstrate that the leading communication complexity of ProxSkip is $\mathcal{O}\left(\frac{pσ^2}{nε^2}\right)$ for non-convex and convex settings, and $\mathcal{O}\left(\frac{pσ^2}{nε}\right)$ for the strongly convex setting, where $n$ represents the number of nodes, $p$ denotes the probability of communication, $σ^2$ signifies the level of stochastic noise, and $ε$ denotes the desired accuracy level. This result illustrates that ProxSkip achieves linear speedup and can asymptotically reduce communication overhead proportional to the probability of communication. Additionally, for the strongly convex setting, we further prove that ProxSkip can achieve linear speedup with network-independent stepsizes. △ Less

Submitted 19 April, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2302.00620 [pdf, other]

Local Exact-Diffusion for Decentralized Optimization and Learning

Authors: Sulaiman A. Alghunaim

Abstract: Distributed optimization methods with local updates have recently attracted a lot of attention due to their potential to reduce the communication cost of distributed methods. In these algorithms, a collection of nodes performs several local updates based on their local data, and then they communicate with each other to exchange estimate information. While there have been many studies on distribute… ▽ More Distributed optimization methods with local updates have recently attracted a lot of attention due to their potential to reduce the communication cost of distributed methods. In these algorithms, a collection of nodes performs several local updates based on their local data, and then they communicate with each other to exchange estimate information. While there have been many studies on distributed local methods with centralized network connections, there has been less work on decentralized networks. In this work, we propose and investigate a locally updated decentralized method called Local Exact-Diffusion (LED). We establish the convergence of LED in both convex and nonconvex settings for the stochastic online setting. Our convergence rate improves over the rate of existing decentralized methods. When we specialize the network to the centralized case, we recover the state-of-the-art bound for centralized methods. We also link LED to several other independently studied distributed methods, including Scaffnew, FedGate, and VRL-SGD. Additionally, we numerically investigate the benefits of local updates for decentralized networks and demonstrate the effectiveness of the proposed method. △ Less

Submitted 10 October, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

arXiv:2301.02855 [pdf, other]

An Enhanced Gradient-Tracking Bound for Distributed Online Stochastic Convex Optimization

Authors: Sulaiman A. Alghunaim, Kun Yuan

Abstract: Gradient-tracking (GT) based decentralized methods have emerged as an effective and viable alternative method to decentralized (stochastic) gradient descent (DSGD) when solving distributed online stochastic optimization problems. Initial studies of GT methods implied that GT methods have worse network dependent rate than DSGD, contradicting experimental results. This dilemma has recently been reso… ▽ More Gradient-tracking (GT) based decentralized methods have emerged as an effective and viable alternative method to decentralized (stochastic) gradient descent (DSGD) when solving distributed online stochastic optimization problems. Initial studies of GT methods implied that GT methods have worse network dependent rate than DSGD, contradicting experimental results. This dilemma has recently been resolved, and tighter rates for GT methods have been established, which improves upon DSGD. In this work, we establish more enhanced rates for GT methods under the online stochastic convex settings. We present an alternative approach for analyzing GT methods for convex problems and over static graphs. When compared to previous analyses, this approach allows us to establish enhanced network dependent rates. △ Less

Submitted 7 January, 2023; originally announced January 2023.

arXiv:2210.04757 [pdf, other]

On the Performance of Gradient Tracking with Local Updates

Authors: Edward Duc Hien Nguyen, Sulaiman A. Alghunaim, Kun Yuan, César A. Uribe

Abstract: We study the decentralized optimization problem where a network of $n$ agents seeks to minimize the average of a set of heterogeneous non-convex cost functions distributedly. State-of-the-art decentralized algorithms like Exact Diffusion~(ED) and Gradient Tracking~(GT) involve communicating every iteration. However, communication is expensive, resource intensive, and slow. In this work, we analyze… ▽ More We study the decentralized optimization problem where a network of $n$ agents seeks to minimize the average of a set of heterogeneous non-convex cost functions distributedly. State-of-the-art decentralized algorithms like Exact Diffusion~(ED) and Gradient Tracking~(GT) involve communicating every iteration. However, communication is expensive, resource intensive, and slow. In this work, we analyze a locally updated GT method (LU-GT), where agents perform local recursions before interacting with their neighbors. While local updates have been shown to reduce communication overhead in practice, their theoretical influence has not been fully characterized. We show LU-GT has the same communication complexity as the Federated Learning setting but allows arbitrary network topologies. In addition, we prove that the number of local updates does not degrade the quality of the solution achieved by LU-GT. Numerical examples reveal that local updates can lower communication costs in certain regimes (e.g., well-connected graphs). △ Less

Submitted 12 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

Comments: 8 pages, 1 figure, submitted to ACC

arXiv:2110.09993 [pdf, other]

doi 10.1109/TSP.2022.3184770

A Unified and Refined Convergence Analysis for Non-Convex Decentralized Learning

Authors: Sulaiman A. Alghunaim, Kun Yuan

Abstract: We study the consensus decentralized optimization problem where the objective function is the average of $n$ agents private non-convex cost functions; moreover, the agents can only communicate to their neighbors on a given network topology. The stochastic learning setting is considered in this paper where each agent can only access a noisy estimate of its gradient. Many decentralized methods can s… ▽ More We study the consensus decentralized optimization problem where the objective function is the average of $n$ agents private non-convex cost functions; moreover, the agents can only communicate to their neighbors on a given network topology. The stochastic learning setting is considered in this paper where each agent can only access a noisy estimate of its gradient. Many decentralized methods can solve such problem including EXTRA, Exact-Diffusion/D$^2$, and gradient-tracking. Unlike the famed DSGD algorithm, these methods have been shown to be robust to the heterogeneity across the local cost functions. However, the established convergence rates for these methods indicate that their sensitivity to the network topology is worse than DSGD. Such theoretical results imply that these methods can perform much worse than DSGD over sparse networks, which, however, contradicts empirical experiments where DSGD is observed to be more sensitive to the network topology. In this work, we study a general stochastic unified decentralized algorithm (SUDA) that includes the above methods as special cases. We establish the convergence of SUDA under both non-convex and the Polyak-Lojasiewicz condition settings. Our results provide improved network topology dependent bounds for these methods (such as Exact-Diffusion/D$^2$ and gradient-tracking) compared with existing literature. Moreover, our results show that these methods are often less sensitive to the network topology compared to DSGD, which agrees with numerical experiments. △ Less

Submitted 16 June, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

arXiv:2105.08023 [pdf, other]

Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD

Authors: Kun Yuan, Sulaiman A. Alghunaim, Xinmeng Huang

Abstract: We consider the decentralized stochastic optimization problems, where a network of $n$ nodes, each owning a local cost function, cooperate to find a minimizer of the globally-averaged cost. A widely studied decentralized algorithm for this problem is decentralized SGD (D-SGD), in which each node averages only with its neighbors. D-SGD is efficient in single-iteration communication, but it is very… ▽ More We consider the decentralized stochastic optimization problems, where a network of $n$ nodes, each owning a local cost function, cooperate to find a minimizer of the globally-averaged cost. A widely studied decentralized algorithm for this problem is decentralized SGD (D-SGD), in which each node averages only with its neighbors. D-SGD is efficient in single-iteration communication, but it is very sensitive to the network topology. For smooth objective functions, the transient stage (which measures the number of iterations the algorithm has to experience before achieving the linear speedup stage) of D-SGD is on the order of $Ω(n/(1-β)^2)$ and $Ω(n^3/(1-β)^4)$ for strongly and generally convex cost functions, respectively, where $1-β\in (0,1)$ is a topology-dependent quantity that approaches $0$ for a large and sparse network. Hence, D-SGD suffers from slow convergence for large and sparse networks. In this work, we study the non-asymptotic convergence property of the D$^2$/Exact-diffusion algorithm. By eliminating the influence of data heterogeneity between nodes, D$^2$/Exact-diffusion is shown to have an enhanced transient stage that is on the order of $\tildeΩ(n/(1-β))$ and $Ω(n^3/(1-β)^2)$ for strongly and generally convex cost functions, respectively. Moreover, when D$^2$/Exact-diffusion is implemented with gradient accumulation and multi-round gossip communications, its transient stage can be further improved to $\tildeΩ(1/(1-β)^{\frac{1}{2}})$ and $\tildeΩ(n/(1-β))$ for strongly and generally convex cost functions, respectively. These established results for D$^2$/Exact-Diffusion have the best (i.e., weakest) dependence on network topology to our knowledge compared to existing decentralized algorithms. We also conduct numerical simulations to validate our theories. △ Less

Submitted 3 March, 2022; v1 submitted 17 May, 2021; originally announced May 2021.

arXiv:2006.08722 [pdf, other]

A Multi-Agent Primal-Dual Strategy for Composite Optimization over Distributed Features

Authors: Sulaiman A. Alghunaim, Ming Yan, Ali H. Sayed

Abstract: This work studies multi-agent sharing optimization problems with the objective function being the sum of smooth local functions plus a convex (possibly non-smooth) function coupling all agents. This scenario arises in many machine learning and engineering applications, such as regression over distributed features and resource allocation. We reformulate this problem into an equivalent saddle-point… ▽ More This work studies multi-agent sharing optimization problems with the objective function being the sum of smooth local functions plus a convex (possibly non-smooth) function coupling all agents. This scenario arises in many machine learning and engineering applications, such as regression over distributed features and resource allocation. We reformulate this problem into an equivalent saddle-point problem, which is amenable to decentralized solutions. We then propose a proximal primal-dual algorithm and establish its linear convergence to the optimal solution when the local functions are strongly-convex. To our knowledge, this is the first linearly convergent decentralized algorithm for multi-agent sharing problems with a general convex (possibly non-smooth) coupling function. △ Less

Submitted 15 June, 2020; originally announced June 2020.

Comments: To appear in European Signal Processing Conference (EUSIPCO) 2020

arXiv:1909.06479 [pdf, other]

Decentralized Proximal Gradient Algorithms with Linear Convergence Rates

Authors: Sulaiman A. Alghunaim, Ernest K. Ryu, Kun Yuan, Ali H. Sayed

Abstract: This work studies a class of non-smooth decentralized multi-agent optimization problems where the agents aim at minimizing a sum of local strongly-convex smooth components plus a common non-smooth term. We propose a general primal-dual algorithmic framework that unifies many existing state-of-the-art algorithms. We establish linear convergence of the proposed method to the exact solution in the pr… ▽ More This work studies a class of non-smooth decentralized multi-agent optimization problems where the agents aim at minimizing a sum of local strongly-convex smooth components plus a common non-smooth term. We propose a general primal-dual algorithmic framework that unifies many existing state-of-the-art algorithms. We establish linear convergence of the proposed method to the exact solution in the presence of the non-smooth term. Moreover, for the more general class of problems with agent specific non-smooth terms, we show that linear convergence cannot be achieved (in the worst case) for the class of algorithms that uses the gradients and the proximal map**s of the smooth and non-smooth parts, respectively. We further provide a numerical counterexample that shows how some state-of-the-art algorithms fail to converge linearly for strongly-convex objectives and different local non-smooth terms. △ Less

Submitted 9 July, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

Comments: To appear in IEEE Transactions on Automatic Control

arXiv:1905.07996 [pdf, other]

A Linearly Convergent Proximal Gradient Algorithm for Decentralized Optimization

Authors: Sulaiman A. Alghunaim, Kun Yuan, Ali H. Sayed

Abstract: Decentralized optimization is a powerful paradigm that finds applications in engineering and learning design. This work studies decentralized composite optimization problems with non-smooth regularization terms. Most existing gradient-based proximal decentralized methods are known to converge to the optimal solution with sublinear rates, and it remains unclear whether this family of methods can ac… ▽ More Decentralized optimization is a powerful paradigm that finds applications in engineering and learning design. This work studies decentralized composite optimization problems with non-smooth regularization terms. Most existing gradient-based proximal decentralized methods are known to converge to the optimal solution with sublinear rates, and it remains unclear whether this family of methods can achieve global linear convergence. To tackle this problem, this work assumes the non-smooth regularization term is common across all networked agents, which is the case for many machine learning problems. Under this condition, we design a proximal gradient decentralized algorithm whose fixed point coincides with the desired minimizer. We then provide a concise proof that establishes its linear convergence. In the absence of the non-smooth term, our analysis technique covers the well known EXTRA algorithm and provides useful bounds on the convergence rate and step-size. △ Less

Submitted 25 October, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

Comments: NeurIPS 2019

arXiv:1904.01196 [pdf, other]

Linear Convergence of Primal-Dual Gradient Methods and their Performance in Distributed Optimization

Authors: Sulaiman A. Alghunaim, Ali H. Sayed

Abstract: In this work, we revisit a classical incremental implementation of the primal-descent dual-ascent gradient method used for the solution of equality constrained optimization problems. We provide a short proof that establishes the linear (exponential) convergence of the algorithm for smooth strongly-convex cost functions and study its relation to the non-incremental implementation. We also study the… ▽ More In this work, we revisit a classical incremental implementation of the primal-descent dual-ascent gradient method used for the solution of equality constrained optimization problems. We provide a short proof that establishes the linear (exponential) convergence of the algorithm for smooth strongly-convex cost functions and study its relation to the non-incremental implementation. We also study the effect of the augmented Lagrangian penalty term on the performance of distributed optimization algorithms for the minimization of aggregate cost functions over multi-agent networks. △ Less

Submitted 16 January, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

arXiv:1903.10956 [pdf, other]

doi 10.1109/TSP.2020.3008605

On the Influence of Bias-Correction on Distributed Stochastic Optimization

Authors: Kun Yuan, Sulaiman A. Alghunaim, Bicheng Ying, Ali H. Sayed

Abstract: Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {\em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to the {\em exact} solution under proper conditions. However, their performance under stochastic and adaptive settings is less explored. It is still un… ▽ More Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {\em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to the {\em exact} solution under proper conditions. However, their performance under stochastic and adaptive settings is less explored. It is still unknown {\em whether}, {\em when} and {\em why} these bias-correction methods can outperform their traditional counterparts (such as consensus and diffusion) with noisy gradient and constant step-sizes. This work studies the performance of exact diffusion under the stochastic and adaptive setting, and provides conditions under which exact diffusion has superior steady-state mean-square deviation (MSD) performance than traditional algorithms without bias-correction. In particular, it is proven that this superiority is more evident over sparsely-connected network topologies such as lines, cycles, or grids. Conditions are also provided under which exact diffusion method match or may even degrade the performance of traditional methods. Simulations are provided to validate the theoretical findings. △ Less

Submitted 11 July, 2019; v1 submitted 26 March, 2019; originally announced March 2019.

Comments: 17 pages, 9 figure, submitted for publication

arXiv:1810.02124 [pdf, other]

A Proximal Diffusion Strategy for Multi-Agent Optimization with Sparse Affine Constraints

Authors: Sulaiman A. Alghunaim, Kun Yuan, Ali H. Sayed

Abstract: This work develops a proximal primal-dual decentralized strategy for multi-agent optimization problems that involve multiple coupled affine constraints, where each constraint may involve only a subset of the agents. The constraints are generally sparse, meaning that only a small subset of the agents are involved in them. This scenario arises in many applications including decentralized control for… ▽ More This work develops a proximal primal-dual decentralized strategy for multi-agent optimization problems that involve multiple coupled affine constraints, where each constraint may involve only a subset of the agents. The constraints are generally sparse, meaning that only a small subset of the agents are involved in them. This scenario arises in many applications including decentralized control formulations, resource allocation problems, and smart grids. Traditional decentralized solutions tend to ignore the structure of the constraints and lead to degraded performance. We instead develop a decentralized solution that exploits the sparsity structure. Under constant step-size learning, the asymptotic convergence of the proposed algorithm is established in the presence of non-smooth terms, and it occurs at a linear rate in the smooth case. We also examine how the performance of the algorithm is influenced by the sparsity of the constraints. Simulations illustrate the superior performance of the proposed strategy. △ Less

Submitted 10 December, 2019; v1 submitted 4 October, 2018; originally announced October 2018.

Comments: accepted for publication in IEEE TAC

arXiv:1712.08817 [pdf, other]

Distributed Coupled Multi-Agent Stochastic Optimization

Authors: Sulaiman A. Alghunaim, Ali H. Sayed

Abstract: This work develops effective distributed strategies for the solution of constrained multi-agent stochastic optimization problems with coupled parameters across the agents. In this formulation, each agent is influenced by only a subset of the entries of a global parameter vector or model, and is subject to convex constraints that are only known locally. Problems of this type arise in several applic… ▽ More This work develops effective distributed strategies for the solution of constrained multi-agent stochastic optimization problems with coupled parameters across the agents. In this formulation, each agent is influenced by only a subset of the entries of a global parameter vector or model, and is subject to convex constraints that are only known locally. Problems of this type arise in several applications, most notably in disease propagation models, minimum-cost flow problems, distributed control formulations, and distributed power system monitoring. This work focuses on stochastic settings, where a stochastic risk function is associated with each agent and the objective is to seek the minimizer of the aggregate sum of all risks subject to a set of constraints. Agents are not aware of the statistical distribution of the data and, therefore, can only rely on stochastic approximations in their learning strategies. We derive an effective distributed learning strategy that is able to track drifts in the underlying parameter model. A detailed performance and stability analysis is carried out showing that the resulting coupled diffusion strategy converges at a linear rate to an $O(μ)-$neighborhood of the true penalized optimizer. △ Less

Submitted 13 March, 2019; v1 submitted 23 December, 2017; originally announced December 2017.

Showing 1–15 of 15 results for author: Alghunaim, S A