Search | arXiv e-print repository

Federated Composite Saddle Point Optimization

Abstract: Federated learning (FL) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Addressing these issues, we… ▽ More Federated learning (FL) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Addressing these issues, we propose Federated Dual Extrapolation (FeDualEx), an extra-step primal-dual algorithm, which is the first of its kind that encompasses both saddle point optimization and composite objectives under the FL paradigm. Both the convergence analysis and the empirical evaluation demonstrate the effectiveness of FeDualEx in these challenging settings. In addition, even for the sequential version of FeDualEx, we provide rates for the stochastic composite saddle point setting which, to our knowledge, are not found in prior literature. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2304.08389 [pdf, other]

Beyond first-order methods for non-convex non-concave min-max optimization

Authors: Abhijeet Vyas, Brian Bullins

Abstract: We propose a study of structured non-convex non-concave min-max problems which goes beyond standard first-order approaches. Inspired by the tight understanding established in recent works [Adil et al., 2022, Lin and Jordan, 2022b], we develop a suite of higher-order methods which show the improvements attainable beyond the monotone and Minty condition settings. Specifically, we provide a new under… ▽ More We propose a study of structured non-convex non-concave min-max problems which goes beyond standard first-order approaches. Inspired by the tight understanding established in recent works [Adil et al., 2022, Lin and Jordan, 2022b], we develop a suite of higher-order methods which show the improvements attainable beyond the monotone and Minty condition settings. Specifically, we provide a new understanding of the use of discrete-time $p^{th}$-order methods for operator norm minimization in the min-max setting, establishing an $O(1/ε^\frac{2}{p})$ rate to achieve $ε$-approximate stationarity, under the weakened Minty variational inequality condition of Diakonikolas et al. [2021]. We further present a continuous-time analysis alongside rates which match those for the discrete-time setting, and our empirical results highlight the practical benefits of our approach over first-order methods. △ Less

Submitted 17 April, 2023; originally announced April 2023.

arXiv:2205.06167 [pdf, ps, other]

Optimal Methods for Higher-Order Smooth Monotone Variational Inequalities

Authors: Deeksha Adil, Brian Bullins, Arun Jambulapati, Sushant Sachdeva

Abstract: In this work, we present new simple and optimal algorithms for solving the variational inequality (VI) problem for $p^{th}$-order smooth, monotone operators -- a problem that generalizes convex optimization and saddle-point problems. Recent works (Bullins and Lai (2020), Lin and Jordan (2021), Jiang and Mokhtari (2022)) present methods that achieve a rate of $\tilde{O}(ε^{-2/(p+1)})$ for… ▽ More In this work, we present new simple and optimal algorithms for solving the variational inequality (VI) problem for $p^{th}$-order smooth, monotone operators -- a problem that generalizes convex optimization and saddle-point problems. Recent works (Bullins and Lai (2020), Lin and Jordan (2021), Jiang and Mokhtari (2022)) present methods that achieve a rate of $\tilde{O}(ε^{-2/(p+1)})$ for $p\geq 1$, extending results by (Nemirovski (2004)) and (Monteiro and Svaiter (2012)) for $p=1,2$. A drawback to these approaches, however, is their reliance on a line search scheme. We provide the first $p^{\textrm{th}}$-order method that achieves a rate of $O(ε^{-2/(p+1)}).$ Our method does not rely on a line search routine, thereby improving upon previous rates by a logarithmic factor. Building on the Mirror Prox method of Nemirovski (2004), our algorithm works even in the constrained, non-Euclidean setting. Furthermore, we prove the optimality of our algorithm by constructing matching lower bounds. These are the first lower bounds for smooth MVIs beyond convex optimization for $p > 1$. This establishes a separation between solving smooth MVIs and smooth convex optimization, and settles the oracle complexity of solving $p^{\textrm{th}}$-order smooth MVIs. △ Less

Submitted 31 May, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: 21 Pages

arXiv:2110.02954 [pdf, other]

A Stochastic Newton Algorithm for Distributed Convex Optimization

Authors: Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth

Abstract: We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations… ▽ More We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication. We show that our method can reduce the number, and frequency, of required communication rounds compared to existing methods without hurting performance, by proving convergence guarantees for quasi-self-concordant objectives (e.g., logistic regression), alongside empirical evidence. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2107.02432 [pdf, ps, other]

Unifying Width-Reduced Methods for Quasi-Self-Concordant Optimization

Authors: Deeksha Adil, Brian Bullins, Sushant Sachdeva

Abstract: We provide several algorithms for constrained optimization of a large class of convex problems, including softmax, $\ell_p$ regression, and logistic regression. Central to our approach is the notion of width reduction, a technique which has proven immensely useful in the context of maximum flow [Christiano et al., STOC'11] and, more recently, $\ell_p$ regression [Adil et al., SODA'19], in terms of… ▽ More We provide several algorithms for constrained optimization of a large class of convex problems, including softmax, $\ell_p$ regression, and logistic regression. Central to our approach is the notion of width reduction, a technique which has proven immensely useful in the context of maximum flow [Christiano et al., STOC'11] and, more recently, $\ell_p$ regression [Adil et al., SODA'19], in terms of improving the iteration complexity from $O(m^{1/2})$ to $\tilde{O}(m^{1/3})$, where $m$ is the number of rows of the design matrix, and where each iteration amounts to a linear system solve. However, a considerable drawback is that these methods require both problem-specific potentials and individually tailored analyses. As our main contribution, we initiate a new direction of study by presenting the first unified approach to achieving $m^{1/3}$-type rates. Notably, our method goes beyond these previously considered problems to more broadly capture quasi-self-concordant losses, a class which has recently generated much interest and includes the well-studied problem of logistic regression, among others. In order to do so, we develop a unified width reduction method for carefully handling these losses based on a more general set of potentials. Additionally, we directly achieve $m^{1/3}$-type rates in the constrained setting without the need for any explicit acceleration schemes, thus naturally complementing recent work based on a ball-oracle approach [Carmon et al., NeurIPS'20]. △ Less

Submitted 6 July, 2021; originally announced July 2021.

arXiv:2102.01583 [pdf, other]

The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

Authors: Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro

Abstract: We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound wi… ▽ More We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound with a matching upper bound that establishes an optimal algorithm. △ Less

Submitted 5 August, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Comments: 48 pages

arXiv:2007.04528 [pdf, ps, other]

Higher-order methods for convex-concave min-max optimization and monotone variational inequalities

Authors: Brian Bullins, Kevin A. Lai

Abstract: We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the $p^{th}$-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of $O(1/T^{\frac{p+1}{2}})$ when given access to an oracle for finding a fixed poi… ▽ More We provide improved convergence rates for constrained convex-concave min-max problems and monotone variational inequalities with higher-order smoothness. In min-max settings where the $p^{th}$-order derivatives are Lipschitz continuous, we give an algorithm HigherOrderMirrorProx that achieves an iteration complexity of $O(1/T^{\frac{p+1}{2}})$ when given access to an oracle for finding a fixed point of a $p^{th}$-order equation. We give analogous rates for the weak monotone variational inequality problem. For $p>2$, our results improve upon the iteration complexity of the first-order Mirror Prox method of Nemirovski [2004] and the second-order method of Monteiro and Svaiter [2012]. We further instantiate our entire algorithm in the unconstrained $p=2$ case. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2002.07839 [pdf, other]

Is Local SGD Better than Minibatch SGD?

Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibat… ▽ More We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee. △ Less

Submitted 20 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 29 pages

arXiv:1906.01621 [pdf, ps, other]

Higher-Order Accelerated Methods for Faster Non-Smooth Optimization

Authors: Brian Bullins, Richard Peng

Abstract: We provide improved convergence rates for various \emph{non-smooth} optimization problems via higher-order accelerated methods. In the case of $\ell_\infty$ regression, we achieves an $O(ε^{-4/5})$ iteration complexity, breaking the $O(ε^{-1})$ barrier so far present for previous methods. We arrive at a similar rate for the problem of $\ell_1$-SVM, going beyond what is attainable by first-order me… ▽ More We provide improved convergence rates for various \emph{non-smooth} optimization problems via higher-order accelerated methods. In the case of $\ell_\infty$ regression, we achieves an $O(ε^{-4/5})$ iteration complexity, breaking the $O(ε^{-1})$ barrier so far present for previous methods. We arrive at a similar rate for the problem of $\ell_1$-SVM, going beyond what is attainable by first-order methods with prox-oracle access for non-smooth non-strongly convex problems. We further show how to achieve even faster rates by introducing higher-order regularization. Our results rely on recent advances in near-optimal accelerated methods for higher-order smooth convex optimization. In particular, we extend Nesterov's smoothing technique to show that the standard softmax approximation is not only smooth in the usual sense, but also \emph{higher-order} smooth. With this observation in hand, we provide the first example of higher-order acceleration techniques yielding faster rates for \emph{non-smooth} optimization, to the best of our knowledge. △ Less

Submitted 4 June, 2019; originally announced June 2019.

arXiv:1902.08721 [pdf, ps, other]

Online Control with Adversarial Disturbances

Authors: Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, Karan Singh

Abstract: We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this pro… ▽ More We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this problem. From a technical standpoint, this work generalizes upon previous work in two main aspects: our model allows for adversarial noise in the dynamics, and allows for general convex costs. △ Less

Submitted 22 February, 2019; originally announced February 2019.

arXiv:1812.10349 [pdf, ps, other]

Fast minimization of structured convex quartics

Authors: Brian Bullins

Abstract: We propose faster methods for unconstrained optimization of \emph{structured convex quartics}, which are convex functions of the form \begin{equation*} f(x) = c^\top x + x^\top \mathbf{G} x + \mathbf{T}[x,x,x] + \frac{1}{24} \mathopen\| \mathbf{A} x \mathclose\|_4^4 \end{equation*} for $c \in \mathbb{R}^d$, $\mathbf{G} \in \mathbb{R}^{d \times d}$,… ▽ More We propose faster methods for unconstrained optimization of \emph{structured convex quartics}, which are convex functions of the form \begin{equation*} f(x) = c^\top x + x^\top \mathbf{G} x + \mathbf{T}[x,x,x] + \frac{1}{24} \mathopen\| \mathbf{A} x \mathclose\|_4^4 \end{equation*} for $c \in \mathbb{R}^d$, $\mathbf{G} \in \mathbb{R}^{d \times d}$, $\mathbf{T} \in \mathbb{R}^{d \times d \times d}$, and $\mathbf{A} \in \mathbb{R}^{n \times d}$ such that $\mathbf{A}^\top \mathbf{A} \succ 0$. In particular, we show how to achieve an $ε$-optimal minimizer for such functions with only $O(n^{1/5}\log^{O(1)}(\mathcal{Z}/ε))$ calls to a gradient oracle and linear system solver, where $\mathcal{Z}$ is a problem-dependent parameter. Our work extends recent ideas on efficient tensor methods and higher-order acceleration techniques to develop a descent method for optimizing the relevant quartic functions. As a natural consequence of our method, we achieve an overall cost of $O(n^{1/5}\log^{O(1)}(\mathcal{Z} / ε))$ calls to a gradient oracle and (sparse) linear system solver for the problem of $\ell_4$-regression when $\mathbf{A}^\top \mathbf{A} \succ 0$, providing additional insight into what may be achieved for general $\ell_p$-regression. Our results show the benefit of combining efficient higher-order methods with recent acceleration techniques for improving convergence rates in fundamental convex optimization problems. △ Less

Submitted 26 December, 2018; originally announced December 2018.

arXiv:1806.02958 [pdf, other]

Efficient Full-Matrix Adaptive Regularization

Authors: Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, Yi Zhang

Abstract: Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide a novel theoretical analysis for adaptive regularizati… ▽ More Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide a novel theoretical analysis for adaptive regularization in non-convex optimization settings. The core of our algorithm, termed GGT, consists of the efficient computation of the inverse square root of a low-rank matrix. Our preliminary experiments show improved iteration-wise convergence rates across synthetic tasks and standard deep learning benchmarks, and that the more carefully-preconditioned steps sometimes lead to a better solution. △ Less

Submitted 17 November, 2020; v1 submitted 7 June, 2018; originally announced June 2018.

Comments: Updated to ICML 2019 camera-ready version. Title of preprint was "The Case for Full-Matrix Adaptive Regularization"

arXiv:1806.00065 [pdf, other]

doi 10.1007/s10107-020-01505-1

Adaptive regularization with cubics on manifolds

Authors: Naman Agarwal, Nicolas Boumal, Brian Bullins, Coralia Cartis

Abstract: Adaptive regularization with cubics (ARC) is an algorithm for unconstrained, non-convex optimization. Akin to the popular trust-region method, its iterations can be thought of as approximate, safe-guarded Newton steps. For cost functions with Lipschitz continuous Hessian, ARC has optimal iteration complexity, in the sense that it produces an iterate with gradient smaller than $\varepsilon$ in… ▽ More Adaptive regularization with cubics (ARC) is an algorithm for unconstrained, non-convex optimization. Akin to the popular trust-region method, its iterations can be thought of as approximate, safe-guarded Newton steps. For cost functions with Lipschitz continuous Hessian, ARC has optimal iteration complexity, in the sense that it produces an iterate with gradient smaller than $\varepsilon$ in $O(1/\varepsilon^{1.5})$ iterations. For the same price, it can also guarantee a Hessian with smallest eigenvalue larger than $-\varepsilon^{1/2}$. In this paper, we study a generalization of ARC to optimization on Riemannian manifolds. In particular, we generalize the iteration complexity results to this richer framework. Our central contribution lies in the identification of appropriate manifold-specific assumptions that allow us to secure these complexity guarantees both when using the exponential map and when using a general retraction. A substantial part of the paper is devoted to studying these assumptions---relevant beyond ARC---and providing user-friendly sufficient conditions for them. Numerical experiments are encouraging. △ Less

Submitted 16 May, 2020; v1 submitted 31 May, 2018; originally announced June 2018.

Comments: 48 pages, 3 figures

arXiv:1611.01146 [pdf, other]

Finding Approximate Local Minima Faster than Gradient Descent

Authors: Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, Tengyu Ma

Abstract: We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of… ▽ More We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning. △ Less

Submitted 24 April, 2017; v1 submitted 3 November, 2016; originally announced November 2016.

arXiv:1305.2147 [pdf, ps, other]

When the largest eigenvalue of the modularity and normalized modularity matrix is zero

Authors: Marianna Bolla, Brian Bullins, Sorathan Chaturapruek, Shiwen Chen, Katalin Friedl

Abstract: In July 2012, at the Conference on Applications of Graph Spectra in Computer Science, Barcelona, D. Stevanovic posed the following open problem: which graphs have the zero as the largest eigenvalue of their modularity matrix? The conjecture was that only the complete and complete multipartite graphs. They indeed have this property, but are they the only ones? In this paper, we will give an affirma… ▽ More In July 2012, at the Conference on Applications of Graph Spectra in Computer Science, Barcelona, D. Stevanovic posed the following open problem: which graphs have the zero as the largest eigenvalue of their modularity matrix? The conjecture was that only the complete and complete multipartite graphs. They indeed have this property, but are they the only ones? In this paper, we will give an affirmative answer to this question and prove a bit more: both the modularity and the normalized modularity matrix of a graph is negative semidefinite if and only if the graph is complete or complete multipartite. △ Less

Submitted 9 May, 2013; originally announced May 2013.

MSC Class: 05C35; 62H30

Showing 1–15 of 15 results for author: Bullins, B