Search | arXiv e-print repository

Hidden Minima in Two-Layer ReLU Networks

Abstract: The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found. The loss at minima belonging to the first type converges to zero as $d$ increases. In the second type, the loss remains bounded away from ze… ▽ More The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found. The loss at minima belonging to the first type converges to zero as $d$ increases. In the second type, the loss remains bounded away from zero. That being so, how may one avoid minima belonging to the latter type? Fortunately, such minima are never detected by standard optimization methods. Motivated by questions concerning the nature of this phenomenon, we develop methods to study distinctive analytic properties of hidden minima. By existing analyses, the Hessian spectrum of both types agree modulo $O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds by studying curves along which the loss is minimized or maximized, generally referred to as tangency arcs. We prove that apparently far removed group representation-theoretic considerations concerning the arrangement of subspaces invariant to the action of subgroups of $S_d$, the symmetry group over $d$ symbols, relative to ones fixed by the action yield a precise description of all finitely many admissible types of tangency arcs. The general results used for the loss function reveal that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating in particular the subtlety of the analysis. The theoretical results, stated and proved for o-minimal structures, show that the set comprising all tangency arcs is topologically sufficiently tame to enable a numerical construction of tangency arcs and so compare how minima, both types, are positioned relative to adjacent critical points. △ Less

Submitted 19 February, 2024; v1 submitted 27 December, 2023; originally announced December 2023.

arXiv:2306.07886 [pdf, ps, other]

Symmetry & Critical Points for Symmetric Tensor Decomposition Problems

Authors: Yossi Arjevani, Gal Vinograd

Abstract: We consider the nonconvex optimization problem associated with the decomposition of a real symmetric tensor into a sum of rank one terms. Use is made of the rich symmetry structure to construct infinite families of critical points represented by Puiseux series in the problem dimension, and so obtain precise analytic estimates on the value of the objective function and the Hessian spectrum. The res… ▽ More We consider the nonconvex optimization problem associated with the decomposition of a real symmetric tensor into a sum of rank one terms. Use is made of the rich symmetry structure to construct infinite families of critical points represented by Puiseux series in the problem dimension, and so obtain precise analytic estimates on the value of the objective function and the Hessian spectrum. The results allow an analytic characterization of various obstructions to using local optimization methods, revealing in particular a complex array of saddles and minima differing by their symmetry, structure and analytic properties. A~desirable phenomenon, occurring for all critical points considered, concerns the number of negative Hessian eigenvalues increasing with the value of the objective function. Our approach makes use of Newton polyhedra as well as results from real algebraic geometry, notably the Curve Selection Lemma, to determine the extremal character of degenerate critical points, establishing in particular the existence of infinite families of third-order saddles which can significantly slow down the optimization process. △ Less

Submitted 7 August, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

arXiv:2210.06088 [pdf, ps, other]

Annihilation of Spurious Minima in Two-Layer ReLU Networks

Authors: Yossi Arjevani, Michael Field

Abstract: We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. Use is made of the rich symmetry structure to develop a novel set of tools for studying the mechanism by which over-parameterization annihilates spurious minima. Sharp analytic estimates are obtained for the loss and the Hessian… ▽ More We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. Use is made of the rich symmetry structure to develop a novel set of tools for studying the mechanism by which over-parameterization annihilates spurious minima. Sharp analytic estimates are obtained for the loss and the Hessian spectrum at different minima, and it is proved that adding neurons can turn symmetric spurious minima into saddles; minima of lesser symmetry require more neurons. Using Cauchy's interlacing theorem, we prove the existence of descent directions in certain subspaces arising from the symmetry structure of the loss function. This analytic approach uses techniques, new to the field, from algebraic geometry, representation theory and symmetry breaking, and confirms rigorously the effectiveness of over-parameterization in making the associated loss landscape accessible to gradient-based methods. For a fixed number of neurons and inputs, the spectral results remain true under symmetry breaking perturbation of the target. △ Less

Submitted 12 October, 2022; originally announced October 2022.

arXiv:2107.10370 [pdf, other]

Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural Networks: A Tale of Symmetry II

Authors: Yossi Arjevani, Michael Field

Abstract: We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. We make use of the rich symmetry structure to develop a novel set of tools for studying families of spurious minima. In contrast to existing approaches which operate in limiting regimes, our technique directly addresses the nonco… ▽ More We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. We make use of the rich symmetry structure to develop a novel set of tools for studying families of spurious minima. In contrast to existing approaches which operate in limiting regimes, our technique directly addresses the nonconvex loss landscape for a finite number of inputs $d$ and neurons $k$, and provides analytic, rather than heuristic, information. In particular, we derive analytic estimates for the loss at different minima, and prove that modulo $O(d^{-1/2})$-terms the Hessian spectrum concentrates near small positive constants, with the exception of $Θ(d)$ eigenvalues which grow linearly with~$d$. We further show that the Hessian spectrum at global and spurious minima coincide to $O(d^{-1/2})$-order, thus challenging our ability to argue about statistical generalization through local curvature. Lastly, our technique provides the exact \emph{fractional} dimensionality at which families of critical points turn from saddles into spurious minima. This makes possible the study of the creation and the annihilation of spurious minima using powerful tools from equivariant bifurcation theory. △ Less

Submitted 17 October, 2021; v1 submitted 21 July, 2021; originally announced July 2021.

Comments: arXiv admin note: text overlap with arXiv:2008.01805

arXiv:2107.02422 [pdf, ps, other]

doi 10.1088/1361-6544/ac619f

Equivariant bifurcation, quadratic equivariants, and symmetry breaking for the standard representation of $S_n$

Authors: Yossi Arjevani, Michael Field

Abstract: Motivated by questions originating from the study of a class of shallow student-teacher neural networks, methods are developed for the analysis of spurious minima in classes of gradient equivariant dynamics related to neural nets. In the symmetric case, methods depend on the generic equivariant bifurcation theory of irreducible representations of the symmetric group on $n$ symbols, $S_n$; in parti… ▽ More Motivated by questions originating from the study of a class of shallow student-teacher neural networks, methods are developed for the analysis of spurious minima in classes of gradient equivariant dynamics related to neural nets. In the symmetric case, methods depend on the generic equivariant bifurcation theory of irreducible representations of the symmetric group on $n$ symbols, $S_n$; in particular, the standard representation of $S_n$. It is shown that spurious minima do not arise from spontaneous symmetry breaking but rather through a complex deformation of the landscape geometry that can be encoded by a generic $S_n$-equivariant bifurcation. We describe minimal models for forced symmetry breaking that give a lower bound on the dynamic complexity involved in the creation of spurious minima when there is no symmetry. Results on generic bifurcation when there are quadratic equivariants are also proved; this work extends and clarifies results of Ihrig & Golubitsky and Chossat, Lauterback & Melbourne on the instability of solutions when there are quadratic equivariants. △ Less

Submitted 6 July, 2021; originally announced July 2021.

arXiv:2103.06234 [pdf, other]

Symmetry Breaking in Symmetric Tensor Decomposition

Authors: Yossi Arjevani, Joan Bruna, Michael Field, Joe Kileel, Matthew Trager, Francis Williams

Abstract: In this note, we consider the highly nonconvex optimization problem associated with computing the rank decomposition of symmetric tensors. We formulate the invariance properties of the loss function and show that critical points detected by standard gradient based methods are \emph{symmetry breaking} with respect to the target tensor. The phenomena, seen for different choices of target tensors and… ▽ More In this note, we consider the highly nonconvex optimization problem associated with computing the rank decomposition of symmetric tensors. We formulate the invariance properties of the loss function and show that critical points detected by standard gradient based methods are \emph{symmetry breaking} with respect to the target tensor. The phenomena, seen for different choices of target tensors and norms, make possible the use of recently developed analytic and algebraic tools for studying nonconvex optimization landscapes exhibiting symmetry breaking phenomena of similar nature. △ Less

Submitted 28 December, 2023; v1 submitted 10 March, 2021; originally announced March 2021.

arXiv:2008.01805 [pdf, other]

Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale of Symmetry

Authors: Yossi Arjevani, Michael Field

Abstract: We consider the optimization problem associated with fitting two-layers ReLU networks with respect to the squared loss, where labels are generated by a target network. We leverage the rich symmetry structure to analytically characterize the Hessian at various families of spurious minima in the natural regime where the number of inputs $d$ and the number of hidden neurons $k$ is finite. In particul… ▽ More We consider the optimization problem associated with fitting two-layers ReLU networks with respect to the squared loss, where labels are generated by a target network. We leverage the rich symmetry structure to analytically characterize the Hessian at various families of spurious minima in the natural regime where the number of inputs $d$ and the number of hidden neurons $k$ is finite. In particular, we prove that for $d\ge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Ω(d)$ of the eigenvalues grow linearly with $k$. Although this phenomenon of extremely skewed spectrum has been observed many times before, to our knowledge, this is the first time it has been established {rigorously}. Our analytic approach uses techniques, new to the field, from symmetry breaking and representation theory, and carries important implications for our ability to argue about statistical generalization through local curvature. △ Less

Submitted 15 October, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

arXiv:2006.13476 [pdf, other]

Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations

Authors: Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, Karthik Sridharan

Abstract: We design an algorithm which finds an $ε$-approximate stationary point (with $\|\nabla F(x)\|\le ε$) using $O(ε^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and---surprisingly---tha… ▽ More We design an algorithm which finds an $ε$-approximate stationary point (with $\|\nabla F(x)\|\le ε$) using $O(ε^{-3})$ stochastic gradient and Hessian-vector products, matching guarantees that were previously available only under a stronger assumption of access to multiple queries with the same random seed. We prove a lower bound which establishes that this rate is optimal and---surprisingly---that it cannot be improved using stochastic $p$th order methods for any $p\ge 2$, even when the first $p$ derivatives of the objective are Lipschitz. Together, these results characterize the complexity of non-convex stochastic optimization with second-order methods and beyond. Expanding our scope to the oracle complexity of finding $(ε,γ)$-approximate second-order stationary points, we establish nearly matching upper and lower bounds for stochastic second-order methods. Our lower bounds here are novel even in the noiseless case. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: Accepted to CONFERENCE ON LEARNING THEORY (COLT) 2020

arXiv:2006.06733 [pdf, other]

IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method

Authors: Yossi Arjevani, Joan Bruna, Bugra Can, Mert Gürbüzbalaban, Stefanie Jegelka, Hongzhou Lin

Abstract: We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA arXiv:140… ▽ More We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA arXiv:1404.6264 and SSDA arXiv:1702.08704. When coupled with accelerated gradient descent, our framework yields a novel primal algorithm whose convergence rate is optimal and matched by recently derived lower bounds. We provide experimental results that demonstrate the effectiveness of the proposed algorithm on highly ill-conditioned problems. △ Less

Submitted 11 June, 2020; originally announced June 2020.

arXiv:2003.10576 [pdf, ps, other]

doi 10.1016/j.physd.2021.133014

Symmetry & critical points for a model shallow neural network

Authors: Yossi Arjevani, Michael Field

Abstract: We consider the optimization problem associated with fitting two-layer ReLU networks with $k$ hidden neurons, where labels are assumed to be generated by a (teacher) neural network. We leverage the rich symmetry exhibited by such models to identify various families of critical points and express them as power series in $k^{-\frac{1}{2}}$. These expressions are then used to derive estimates for sev… ▽ More We consider the optimization problem associated with fitting two-layer ReLU networks with $k$ hidden neurons, where labels are assumed to be generated by a (teacher) neural network. We leverage the rich symmetry exhibited by such models to identify various families of critical points and express them as power series in $k^{-\frac{1}{2}}$. These expressions are then used to derive estimates for several related quantities which imply that not all spurious minima are alike. In particular, we show that while the loss function at certain types of spurious minima decays to zero like $k^{-1}$, in other cases the loss converges to a strictly positive constant. The methods used depend on symmetry, the geometry of group actions, bifurcation, and Artin's implicit function theorem. △ Less

Submitted 11 March, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

arXiv:2002.03273 [pdf, ps, other]

On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

Authors: Yossi Arjevani, Amit Daniely, Stefanie Jegelka, Hongzhou Lin

Abstract: Recent advances in randomized incremental methods for minimizing $L$-smooth $μ$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/μ})\log(1/ε))$ and $O(n+\sqrt{nL/ε})$, where $μ>0$ and $μ=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge o… ▽ More Recent advances in randomized incremental methods for minimizing $L$-smooth $μ$-strongly convex finite sums have culminated in tight complexity of $\tilde{O}((n+\sqrt{n L/μ})\log(1/ε))$ and $O(n+\sqrt{nL/ε})$, where $μ>0$ and $μ=0$, respectively, and $n$ denotes the number of individual functions. Unlike incremental methods, stochastic methods for finite sums do not rely on an explicit knowledge of which individual function is being addressed at each iteration, and as such, must perform at least $Ω(n^2)$ iterations to obtain $O(1/n^2)$-optimal solutions. In this work, we exploit the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight. Following a similar approach, we propose a novel adaptation of SVRG which is both \emph{compatible with stochastic oracles}, and achieves complexity bounds of $\tilde{O}((n^2+n\sqrt{L/μ})\log(1/ε))$ and $O(n\sqrt{L/ε})$, for $μ>0$ and $μ=0$, respectively. Our bounds hold w.h.p. and match in part existing lower bounds of $\tildeΩ(n^2+\sqrt{nL/μ}\log(1/ε))$ and $\tildeΩ(n^2+\sqrt{nL/ε})$, for $μ>0$ and $μ=0$, respectively. △ Less

Submitted 8 February, 2020; originally announced February 2020.

arXiv:1912.11939 [pdf, other]

On the Principle of Least Symmetry Breaking in Shallow ReLU Models

Authors: Yossi Arjevani, Michael Field

Abstract: We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect t… ▽ More We consider the optimization problem associated with fitting two-layer ReLU networks with respect to the squared loss, where labels are assumed to be generated by a target network. Focusing first on standard Gaussian inputs, we show that the structure of spurious local minima detected by stochastic gradient descent (SGD) is, in a well-defined sense, the \emph{least loss of symmetry} with respect to the target weights. A closer look at the analysis indicates that this principle of least symmetry breaking may apply to a broader range of settings. Motivated by this, we conduct a series of experiments which corroborate this hypothesis for different classes of non-isotropic non-product distributions, smooth activation functions and networks with a few layers. △ Less

Submitted 28 December, 2023; v1 submitted 26 December, 2019; originally announced December 2019.

arXiv:1912.02365 [pdf, other]

Lower Bounds for Non-Convex Stochastic Optimization

Authors: Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, Blake Woodworth

Abstract: We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an… ▽ More We lower bound the complexity of finding $ε$-stationary points (with gradient norm at most $ε$) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $ε^{-4}$ queries to find an $ε$ stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $ε^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques. △ Less

Submitted 27 February, 2022; v1 submitted 4 December, 2019; originally announced December 2019.

Comments: Correction to hard instance dimensions in Theorem 3

arXiv:1806.10188 [pdf, ps, other]

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

Authors: Yossi Arjevani, Ohad Shamir, Nathan Srebro

Abstract: We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $τ$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/τ$ of the gradient… ▽ More We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $τ$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/τ$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions. △ Less

Submitted 26 June, 2018; originally announced June 2018.

arXiv:1706.01686 [pdf, ps, other]

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Authors: Yossi Arjevani

Abstract: We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sum optimization problems. First, we show that, perhaps surprisingly, the finite sum structure by itself, is not sufficient for obtaining a complexity bound of $\tilde{\cO}((n+L/μ)\ln(1/ε))$ for $L$-smooth and $μ$-strongly convex individual functions - one must also know which… ▽ More We study the conditions under which one is able to efficiently apply variance-reduction and acceleration schemes on finite sum optimization problems. First, we show that, perhaps surprisingly, the finite sum structure by itself, is not sufficient for obtaining a complexity bound of $\tilde{\cO}((n+L/μ)\ln(1/ε))$ for $L$-smooth and $μ$-strongly convex individual functions - one must also know which individual function is being referred to by the oracle at each iteration. Next, we show that for a broad class of first-order and coordinate-descent finite sum algorithms (including, e.g., SDCA, SVRG, SAG), it is not possible to get an `accelerated' complexity bound of $\tilde{\cO}((n+\sqrt{n L/μ})\ln(1/ε))$, unless the strong convexity parameter is given explicitly. Lastly, we show that when this class of algorithms is used for minimizing $L$-smooth and convex finite sums, the optimal complexity bound is $\tilde{\cO}(n+L/ε)$, assuming that (on average) the same update rule is used in every iteration, and $\tilde{\cO}(n+\sqrt{nL/ε})$, otherwise. △ Less

Submitted 6 December, 2017; v1 submitted 6 June, 2017; originally announced June 2017.

arXiv:1705.07260 [pdf, ps, other]

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Authors: Yossi Arjevani, Ohad Shamir, Ron Shiff

Abstract: Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indica… ▽ More Second-order methods, which utilize gradients as well as Hessians to optimize a given function, are of major importance in mathematical optimization. In this work, we prove tight bounds on the oracle complexity of such methods for smooth convex functions, or equivalently, the worst-case number of iterations required to optimize such functions to a given accuracy. In particular, these bounds indicate when such methods can or cannot improve on gradient-based methods, whose oracle complexity is much better understood. We also provide generalizations of our results to higher-order methods. △ Less

Submitted 17 August, 2017; v1 submitted 20 May, 2017; originally announced May 2017.

Comments: 35 pages; Added discussion of matching upper bounds, and generalization to higher-order methods

arXiv:1611.04982 [pdf, ps, other]

Oracle Complexity of Second-Order Methods for Finite-Sum Problems

Authors: Yossi Arjevani, Ohad Shamir

Abstract: Finite-sum optimization problems are ubiquitous in machine learning, and are commonly solved using first-order methods which rely on gradient computations. Recently, there has been growing interest in \emph{second-order} methods, which rely on both gradients and Hessians. In principle, second-order methods can require much fewer iterations than first-order methods, and hold the promise for more ef… ▽ More Finite-sum optimization problems are ubiquitous in machine learning, and are commonly solved using first-order methods which rely on gradient computations. Recently, there has been growing interest in \emph{second-order} methods, which rely on both gradients and Hessians. In principle, second-order methods can require much fewer iterations than first-order methods, and hold the promise for more efficient algorithms. Although computing and manipulating Hessians is prohibitive for high-dimensional problems in general, the Hessians of individual functions in finite-sum problems can often be efficiently computed, e.g. because they possess a low-rank structure. Can second-order information indeed be used to solve such problems more efficiently? In this paper, we provide evidence that the answer -- perhaps surprisingly -- is negative, at least in terms of worst-case guarantees. However, we also discuss what additional assumptions and algorithmic approaches might potentially circumvent this negative result. △ Less

Submitted 8 March, 2017; v1 submitted 15 November, 2016; originally announced November 2016.

Comments: 30 pages

arXiv:1606.09333 [pdf, other]

Dimension-Free Iteration Complexity of Finite Sum Optimization Problems

Authors: Yossi Arjevani, Ohad Shamir

Abstract: Many canonical machine learning problems boil down to a convex optimization problem with a finite sum structure. However, whereas much progress has been made in develo** faster algorithms for this setting, the inherent limitations of these problems are not satisfactorily addressed by existing lower bounds. Indeed, current bounds focus on first-order optimization algorithms, and only apply in the… ▽ More Many canonical machine learning problems boil down to a convex optimization problem with a finite sum structure. However, whereas much progress has been made in develo** faster algorithms for this setting, the inherent limitations of these problems are not satisfactorily addressed by existing lower bounds. Indeed, current bounds focus on first-order optimization algorithms, and only apply in the often unrealistic regime where the number of iterations is less than $\mathcal{O}(d/n)$ (where $d$ is the dimension and $n$ is the number of samples). In this work, we extend the framework of (Arjevani et al., 2015) to provide new lower bounds, which are dimension-free, and go beyond the assumptions of current bounds, thereby covering standard finite sum optimization methods, e.g., SAG, SAGA, SVRG, SDCA without duality, as well as stochastic coordinate-descent methods, such as SDCA and accelerated proximal SDCA. △ Less

Submitted 29 June, 2016; originally announced June 2016.

arXiv:1605.03529 [pdf, ps, other]

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

Authors: Yossi Arjevani, Ohad Shamir

Abstract: We consider a broad class of first-order optimization algorithms which are \emph{oblivious}, in the sense that their step sizes are scheduled regardless of the function under consideration, except for limited side-information such as smoothness or strong convexity parameters. With the knowledge of these two parameters, we show that any such algorithm attains an iteration complexity lower bound of… ▽ More We consider a broad class of first-order optimization algorithms which are \emph{oblivious}, in the sense that their step sizes are scheduled regardless of the function under consideration, except for limited side-information such as smoothness or strong convexity parameters. With the knowledge of these two parameters, we show that any such algorithm attains an iteration complexity lower bound of $Ω(\sqrt{L/ε})$ for $L$-smooth convex functions, and $\tildeΩ(\sqrt{L/μ}\ln(1/ε))$ for $L$-smooth $μ$-strongly convex functions. These lower bounds are stronger than those in the traditional oracle model, as they hold independently of the dimension. To attain these, we abandon the oracle model in favor of a structure-based approach which builds upon a framework recently proposed in (Arjevani et al., 2015). We further show that without knowing the strong convexity parameter, it is impossible to attain an iteration complexity better than $\tildeΩ\left((L/μ)\ln(1/ε)\right)$. This result is then used to formalize an observation regarding $L$-smooth convex functions, namely, that the iteration complexity of algorithms employing time-invariant step sizes must be at least $Ω(L/ε)$. △ Less

Submitted 11 May, 2016; originally announced May 2016.

arXiv:1506.01900 [pdf, ps, other]

Communication Complexity of Distributed Convex Learning and Optimization

Authors: Yossi Arjevani, Ohad Shamir

Abstract: We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other thin… ▽ More We study the fundamental limits to communication-efficient distributed methods for convex learning and optimization, under different assumptions on the information available to individual machines, and the types of functions considered. We identify cases where existing algorithms are already worst-case optimal, as well as cases where room for further improvement is still possible. Among other things, our results indicate that without similarity between the local objective functions (due to statistical data similarity or otherwise) many communication rounds may be required, even if the machines have unbounded computational power. △ Less

Submitted 28 October, 2015; v1 submitted 5 June, 2015; originally announced June 2015.

arXiv:1503.06833 [pdf, other]

On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems

Authors: Yossi Arjevani, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We develop a novel framework to study smooth and strongly convex optimization algorithms, both deterministic and stochastic. Focusing on quadratic functions we are able to examine optimization algorithms as a recursive application of linear operators. This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and… ▽ More We develop a novel framework to study smooth and strongly convex optimization algorithms, both deterministic and stochastic. Focusing on quadratic functions we are able to examine optimization algorithms as a recursive application of linear operators. This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and upper bounds are derived. Whereas existing lower bounds for this setting are only valid when the dimensionality scales with the number of iterations, our lower bound holds in the natural regime where the dimensionality is fixed. Lastly, expressing it as an optimal solution for the corresponding optimization problem over polynomials, as formulated by our framework, we present a novel systematic derivation of Nesterov's well-known Accelerated Gradient Descent method. This rather natural interpretation of AGD contrasts with earlier ones which lacked a simple, yet solid, motivation. △ Less

Submitted 23 March, 2015; originally announced March 2015.

arXiv:1410.6387 [pdf, other]

On Lower and Upper Bounds in Smooth Strongly Convex Optimization - A Unified Approach via Linear Iterative Methods

Authors: Yossi Arjevani

Abstract: In this thesis we develop a novel framework to study smooth and strongly convex optimization algorithms, both deterministic and stochastic. Focusing on quadratic functions we are able to examine optimization algorithms as a recursive application of linear operators. This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereb… ▽ More In this thesis we develop a novel framework to study smooth and strongly convex optimization algorithms, both deterministic and stochastic. Focusing on quadratic functions we are able to examine optimization algorithms as a recursive application of linear operators. This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and upper bounds are derived. In particular, we present a new and natural derivation of Nesterov's well-known Accelerated Gradient Descent method by employing simple 'economic' polynomials. This rather natural interpretation of AGD contrasts with earlier ones which lacked a simple, yet solid, motivation. Lastly, whereas existing lower bounds are only valid when the dimensionality scales with the number of iterations, our lower bound holds in the natural regime where the dimensionality is fixed. △ Less

Submitted 23 October, 2014; originally announced October 2014.

Comments: A related paper co-authored with Shai Shalev-Shwartz and Ohad Shamir is to be published soon

Showing 1–22 of 22 results for author: Arjevani, Y