-
First-Order Methods for Linearly Constrained Bilevel Optimization
Authors:
Guy Kornowski,
Swati Padmanabhan,
Kai Wang,
Zhe Zhang,
Suvrit Sra
Abstract:
Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present first-order linearly constrained optimization methods with finite-time hypergradient stationarity guarantees. For linear equality cons…
▽ More
Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present first-order linearly constrained optimization methods with finite-time hypergradient stationarity guarantees. For linear equality constraints, we attain $ε$-stationarity in $\widetilde{O}(ε^{-2})$ gradient oracle calls, which is nearly-optimal. For linear inequality constraints, we attain $(δ,ε)$-Goldstein stationarity in $\widetilde{O}(d{δ^{-1} ε^{-3}})$ gradient oracle calls, where $d$ is the upper-level dimension. Finally, we obtain for the linear inequality setting dimension-free rates of $\widetilde{O}({δ^{-1} ε^{-4}})$ oracle complexity under the additional assumption of oracle access to the optimal dual variable. Along the way, we develop new nonsmooth nonconvex optimization methods with inexact oracles. We verify these guarantees with preliminary numerical experiments.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Riemannian Bilevel Optimization
Authors:
Sanchayan Dutta,
Xiang Cheng,
Suvrit Sra
Abstract:
We develop new algorithms for Riemannian bilevel optimization. We focus in particular on batch and stochastic gradient-based methods, with the explicit goal of avoiding second-order information such as Riemannian hyper-gradients. We propose and analyze $\mathrm{RF^2SA}$, a method that leverages first-order gradient information to navigate the complex geometry of Riemannian manifolds efficiently. N…
▽ More
We develop new algorithms for Riemannian bilevel optimization. We focus in particular on batch and stochastic gradient-based methods, with the explicit goal of avoiding second-order information such as Riemannian hyper-gradients. We propose and analyze $\mathrm{RF^2SA}$, a method that leverages first-order gradient information to navigate the complex geometry of Riemannian manifolds efficiently. Notably, $\mathrm{RF^2SA}$ is a single-loop algorithm, and thus easier to implement and use. Under various setups, including stochastic optimization, we provide explicit convergence rates for reaching $ε$-stationary points. We also address the challenge of optimizing over Riemannian manifolds with constraints by adjusting the multiplier in the Lagrangian, ensuring convergence to the desired solution without requiring access to second-order derivatives.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Efficient Sampling on Riemannian Manifolds via Langevin MCMC
Authors:
Xiang Cheng,
**gzhao Zhang,
Suvrit Sra
Abstract:
We study the task of efficiently sampling from a Gibbs distribution $d π^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice. The key to our analysis of Langevin MCMC is a bound on the discretization error of the geometric Euler-Murayama sche…
▽ More
We study the task of efficiently sampling from a Gibbs distribution $d π^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice. The key to our analysis of Langevin MCMC is a bound on the discretization error of the geometric Euler-Murayama scheme, assuming $\nabla h$ is Lipschitz and $M$ has bounded sectional curvature. Our error bound matches the error of Euclidean Euler-Murayama in terms of its stepsize dependence. Combined with a contraction guarantee for the geometric Langevin Diffusion under Kendall-Cranston coupling, we prove that the Langevin MCMC iterates lie within $ε$-Wasserstein distance of $π^*$ after $\tilde{O}(ε^{-2})$ steps, which matches the iteration complexity for Euclidean Langevin MCMC. Our results apply in general settings where $h$ can be nonconvex and $M$ can have negative Ricci curvature. Under additional assumptions that the Riemannian curvature tensor has bounded derivatives, and that $π^*$ satisfies a $CD(\cdot,\infty)$ condition, we analyze the stochastic gradient version of Langevin MCMC, and bound its iteration complexity by $\tilde{O}(ε^{-2})$ as well.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context
Authors:
Xiang Cheng,
Yuxin Chen,
Suvrit Sra
Abstract:
Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. This paper provides theoretical and empirical evidence that (non-linear) Transformers naturally learn to implement gradient descent in funct…
▽ More
Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. This paper provides theoretical and empirical evidence that (non-linear) Transformers naturally learn to implement gradient descent in function space, which in turn enable them to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures and non-linear in-context learning tasks. Additionally, we show that the optimal choice of non-linear activation depends in a natural way on the class of functions that need to be learned.
△ Less
Submitted 3 June, 2024; v1 submitted 11 December, 2023;
originally announced December 2023.
-
Linear attention is (maybe) all you need (to understand transformer optimization)
Authors:
Kwangjun Ahn,
Xiang Cheng,
Minhak Song,
Chulhee Yun,
Ali Jadbabaie,
Suvrit Sra
Abstract:
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and…
▽ More
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.
△ Less
Submitted 13 March, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Invex Programs: First Order Algorithms and Their Convergence
Authors:
Adarsh Barik,
Suvrit Sra,
Jean Honorio
Abstract:
Invex programs are a special kind of non-convex problems which attain global minima at every stationary point. While classical first-order gradient descent methods can solve them, they converge very slowly. In this paper, we propose new first-order algorithms to solve the general class of invex problems. We identify sufficient conditions for convergence of our algorithms and provide rates of conve…
▽ More
Invex programs are a special kind of non-convex problems which attain global minima at every stationary point. While classical first-order gradient descent methods can solve them, they converge very slowly. In this paper, we propose new first-order algorithms to solve the general class of invex problems. We identify sufficient conditions for convergence of our algorithms and provide rates of convergence. Furthermore, we go beyond unconstrained problems and provide a novel projected gradient method for constrained invex programs with convergence rate guarantees. We compare and contrast our results with existing first-order algorithms for a variety of unconstrained and constrained invex problems. To the best of our knowledge, our proposed algorithm is the first algorithm to solve constrained invex programs.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Transformers learn to implement preconditioned gradient descent for in-context learning
Authors:
Kwangjun Ahn,
Xiang Cheng,
Hadi Daneshmand,
Suvrit Sra
Abstract:
Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instance…
▽ More
Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers.
△ Less
Submitted 9 November, 2023; v1 submitted 31 May, 2023;
originally announced June 2023.
-
How to escape sharp minima with random perturbations
Authors:
Kwangjun Ahn,
Ali Jadbabaie,
Suvrit Sra
Abstract:
Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use…
▽ More
Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, we then analyze algorithms that find approximate flat minima efficiently. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice.
△ Less
Submitted 25 May, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
The Crucial Role of Normalization in Sharpness-Aware Minimization
Authors:
Yan Dai,
Kwangjun Ahn,
Suvrit Sra
Abstract:
Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically an…
▽ More
Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima -- a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments.
△ Less
Submitted 23 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
On the Training Instability of Shuffling SGD with Batch Normalization
Authors:
David X. Wu,
Chulhee Yun,
Suvrit Sra
Abstract:
We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for r…
▽ More
We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.
△ Less
Submitted 14 August, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
The Hornich-Hlawka functional inequality for functions with positive differences
Authors:
Constantin P. Niculescu,
Suvrit Sra
Abstract:
We analyze the role played by $n$-convexity for the fulfillment of a series of linear functional inequalities that extend the Hornich-Hlawka functional inequality, $f\left( x\right) +f\left( y\right) +f\left( z\right) +f\left( x+y+z\right) \geq f\left( x+y\right) +f\left( y+z\right)+f\left( z+x\right) +f(0),$ including extensions to the case of positive operators.
We analyze the role played by $n$-convexity for the fulfillment of a series of linear functional inequalities that extend the Hornich-Hlawka functional inequality, $f\left( x\right) +f\left( y\right) +f\left( z\right) +f\left( x+y+z\right) \geq f\left( x+y\right) +f\left( y+z\right)+f\left( z+x\right) +f(0),$ including extensions to the case of positive operators.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?
Authors:
Yi Tian,
Kaiqing Zhang,
Russ Tedrake,
Suvrit Sra
Abstract:
We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particul…
▽ More
We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations.
△ Less
Submitted 13 March, 2024; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Computing Brascamp-Lieb Constants through the lens of Thompson Geometry
Authors:
Melanie Weber,
Suvrit Sra
Abstract:
This paper studies algorithms for efficiently computing Brascamp-Lieb constants, a task that has recently received much interest. In particular, we reduce the computation to a nonlinear matrix-valued iteration, whose convergence we analyze through the lens of fixed-point methods under the well-known Thompson metric. This approach permits us to obtain (weakly) polynomial time guarantees, and it off…
▽ More
This paper studies algorithms for efficiently computing Brascamp-Lieb constants, a task that has recently received much interest. In particular, we reduce the computation to a nonlinear matrix-valued iteration, whose convergence we analyze through the lens of fixed-point methods under the well-known Thompson metric. This approach permits us to obtain (weakly) polynomial time guarantees, and it offers an efficient and transparent alternative to previous state-of-the-art approaches based on Riemannian optimization and geodesic convexity.
△ Less
Submitted 14 April, 2024; v1 submitted 9 August, 2022;
originally announced August 2022.
-
CCCP is Frank-Wolfe in disguise
Authors:
Alp Yurtsever,
Suvrit Sra
Abstract:
This paper uncovers a simple but rather surprising connection: it shows that the well-known convex-concave procedure (CCCP) and its generalization to constrained problems are both special cases of the Frank-Wolfe (FW) method. This connection not only provides insight of deep (in our opinion) pedagogical value, but also transfers the recently discovered convergence theory of nonconvex Frank-Wolfe m…
▽ More
This paper uncovers a simple but rather surprising connection: it shows that the well-known convex-concave procedure (CCCP) and its generalization to constrained problems are both special cases of the Frank-Wolfe (FW) method. This connection not only provides insight of deep (in our opinion) pedagogical value, but also transfers the recently discovered convergence theory of nonconvex Frank-Wolfe methods immediately to CCCP, closing a long-standing gap in its non-asymptotic convergence theory. We hope the viewpoint uncovered by this paper spurs the transfer of other advances made for FW to both CCCP and its generalizations.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
On a class of geodesically convex optimization problems solved via Euclidean MM methods
Authors:
Melanie Weber,
Suvrit Sra
Abstract:
We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality…
▽ More
We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality along with guarantees on iteration complexity. On the other hand, the split structure permits us to develop Euclidean Majorization-Minorization algorithms that help us bypass the need to compute expensive Riemannian operations such as exponential maps and parallel transport. We illustrate our results by specializing them to a few concrete optimization problems that have been previously studied in the machine learning literature. Ultimately, we hope our work helps motivate the broader search for mixed Euclidean-Riemannian optimization algorithms
△ Less
Submitted 20 October, 2022; v1 submitted 22 June, 2022;
originally announced June 2022.
-
Theory and Algorithms for Diffusion Processes on Riemannian Manifolds
Authors:
Xiang Cheng,
**gzhao Zhang,
Suvrit Sra
Abstract:
We study geometric stochastic differential equations (SDEs) and their approximations on Riemannian manifolds. In particular, we introduce a simple new construction of geometric SDEs, using which with bounded curvature. In particular, we provide the first (to our knowledge) non-asymptotic bound on the error of the geometric Euler-Murayama discretization. We then bound the distance between the exact…
▽ More
We study geometric stochastic differential equations (SDEs) and their approximations on Riemannian manifolds. In particular, we introduce a simple new construction of geometric SDEs, using which with bounded curvature. In particular, we provide the first (to our knowledge) non-asymptotic bound on the error of the geometric Euler-Murayama discretization. We then bound the distance between the exact SDE and a discrete geometric random walk, where the noise can be non-Gaussian; this analysis is useful for using geometric SDEs to model naturally occurring discrete non-Gaussian stochastic processes. Our results provide convenient tools for studying MCMC algorithms that adopt non-standard noise distributions.
△ Less
Submitted 20 November, 2023; v1 submitted 28 April, 2022;
originally announced April 2022.
-
Understanding the unstable convergence of gradient descent
Authors:
Kwangjun Ahn,
**gzhao Zhang,
Suvrit Sra
Abstract:
Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from fir…
▽ More
Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from first principles, and discuss key causes behind it. We also identify its main characteristics, and how they interrelate based on both theory and experiments, offering a principled view toward understanding the phenomenon.
△ Less
Submitted 9 June, 2022; v1 submitted 3 April, 2022;
originally announced April 2022.
-
Sign and Basis Invariant Networks for Spectral Graph Representation Learning
Authors:
Derek Lim,
Joshua Robinson,
Lingxiao Zhao,
Tess Smidt,
Suvrit Sra,
Haggai Maron,
Stefanie Jegelka
Abstract:
We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i…
▽ More
We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i.e., they can approximate any continuous function of eigenvectors with the desired invariances. When used with Laplacian eigenvectors, our networks are provably more expressive than existing spectral methods on graphs; for instance, they subsume all spectral graph convolutions, certain spectral graph invariants, and previously proposed graph positional encodings as special cases. Experiments show that our networks significantly outperform existing baselines on molecular graph regression, learning expressive graph representations, and learning neural fields on triangle meshes. Our code is available at https://github.com/cptq/SignNet-BasisNet .
△ Less
Submitted 30 September, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
Sion's Minimax Theorem in Geodesic Metric Spaces and a Riemannian Extragradient Algorithm
Authors:
Peiyuan Zhang,
**gzhao Zhang,
Suvrit Sra
Abstract:
Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. This paper takes a step towards understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems over geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems.…
▽ More
Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. This paper takes a step towards understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems over geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and broadly accessible as it relies on the finite intersection property alone. The second main result is a specialization to geodesically complete Riemannian manifolds: here, we devise and analyze the complexity of first-order methods for smooth minimax problems.
△ Less
Submitted 28 May, 2023; v1 submitted 13 February, 2022;
originally announced February 2022.
-
Time varying regression with hidden linear dynamics
Authors:
Ali Jadbabaie,
Horia Mania,
Devavrat Shah,
Suvrit Sra
Abstract:
We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and d…
▽ More
We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work.
△ Less
Submitted 29 December, 2021;
originally announced December 2021.
-
Max-Margin Contrastive Learning
Authors:
Anshul Shah,
Suvrit Sra,
Rama Chellappa,
Anoop Cherian
Abstract:
Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learni…
▽ More
Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learning (MMCL). Our approach selects negatives as the sparse support vectors obtained via a quadratic optimization problem, and contrastiveness is enforced by maximizing the decision margin. As SVM optimization can be computationally demanding, especially in an end-to-end setting, we present simplifications that alleviate the computational burden. We validate our approach on standard vision benchmark datasets, demonstrating better performance in unsupervised representation learning over state-of-the-art, while having better empirical convergence properties.
△ Less
Submitted 21 December, 2021;
originally announced December 2021.
-
Positive definite functions of noncommuting contractions, Hua-Bellman matrices, and a new distance metric
Authors:
Suvrit Sra
Abstract:
We study positive definite functions on noncommuting strict contractions. In particular, we study functions that induce positive definite Hua-Bellman matrices (i.e., matrices of the form $[\det(I-A_i^*A_j)^{-α}]_{ij}$ where $A_i$ and $A_j$ are strict contractions and $α\in\mathbb{C}$). We start by revisiting a 1959 work of \citeauthor{bellman1959} (R.~Bellman~\emph{Representation theorems and ineq…
▽ More
We study positive definite functions on noncommuting strict contractions. In particular, we study functions that induce positive definite Hua-Bellman matrices (i.e., matrices of the form $[\det(I-A_i^*A_j)^{-α}]_{ij}$ where $A_i$ and $A_j$ are strict contractions and $α\in\mathbb{C}$). We start by revisiting a 1959 work of \citeauthor{bellman1959} (R.~Bellman~\emph{Representation theorems and inequalities for Hermitian matrices}; Duke Mathematical J., 26(3), 1959) that studies Hua-Bellman matrices and claims a strengthening of \citeauthor{hua1955}'s representation theoretic results on their positive definiteness~(L.-K. Hua, \emph{Inequalities involving determinants}; Acta Mathematica Sinica, 5(1955), pp.~463--470). We uncover a critical error in Bellman's proof that has surprisingly escaped notice to date. We "fix" this error and provide conditions under which $\det(I-A^*B)^{-α}$ is a positive definite function; our conditions correct Bellman's claim and subsume both Bellman's and Hua's prior results. Subsequently, we build on our result and introduce a new hyperbolic-like geometry on noncommuting contractions, and remark on its potential applications.
△ Less
Submitted 30 November, 2021;
originally announced December 2021.
-
Introducing Discrepancy Values of Matrices with Application to Bounding Norms of Commutators
Authors:
Pourya Habib Zadeh,
Suvrit Sra
Abstract:
We introduce discrepancy values, quantities inspired by the notion of the spectral spread of Hermitian matrices. We define them as the discrepancy between two consecutive Ky-Fan-like seminorms. As a result, discrepancy values share many properties with singular values and eigenvalues, yet are substantially different to merit their own study. We describe key properties of discrepancy values, and es…
▽ More
We introduce discrepancy values, quantities inspired by the notion of the spectral spread of Hermitian matrices. We define them as the discrepancy between two consecutive Ky-Fan-like seminorms. As a result, discrepancy values share many properties with singular values and eigenvalues, yet are substantially different to merit their own study. We describe key properties of discrepancy values, and establish several tools such as representation theorems, majorization inequalities, convex formulations, etc., for working with them. As an important application, we illustrate the role of discrepancy values in deriving tight bounds on the norms of commutators.
△ Less
Submitted 14 June, 2022; v1 submitted 23 November, 2021;
originally announced November 2021.
-
Understanding Riemannian Acceleration via a Proximal Extragradient Framework
Authors:
Jikai **,
Suvrit Sra
Abstract:
We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of ne…
▽ More
We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of new insights into Euclidean A-HPE itself; and (ii) a careful control of metric distortion caused by Riemannian geometry. We illustrate our framework by obtaining a few existing and new Riemannian accelerated gradient methods as special cases, while characterizing their acceleration as corollaries of our main results.
△ Less
Submitted 9 February, 2022; v1 submitted 4 November, 2021;
originally announced November 2021.
-
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond
Authors:
Chulhee Yun,
Shashank Rajput,
Suvrit Sra
Abstract:
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients…
▽ More
In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
△ Less
Submitted 23 March, 2022; v1 submitted 19 October, 2021;
originally announced October 2021.
-
Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective
Authors:
**gzhao Zhang,
Haochuan Li,
Suvrit Sra,
Ali Jadbabaie
Abstract:
This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the…
▽ More
This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
△ Less
Submitted 17 June, 2022; v1 submitted 12 October, 2021;
originally announced October 2021.
-
Three Operator Splitting with Subgradients, Stochastic Gradients, and Adaptive Learning Rates
Authors:
Alp Yurtsever,
Alex Gu,
Suvrit Sra
Abstract:
Three Operator Splitting (TOS) (Davis & Yin, 2017) can minimize the sum of multiple convex functions effectively when an efficient gradient oracle or proximal operator is available for each term. This requirement often fails in machine learning applications: (i) instead of full gradients only stochastic gradients may be available; and (ii) instead of proximal operators, using subgradients to handl…
▽ More
Three Operator Splitting (TOS) (Davis & Yin, 2017) can minimize the sum of multiple convex functions effectively when an efficient gradient oracle or proximal operator is available for each term. This requirement often fails in machine learning applications: (i) instead of full gradients only stochastic gradients may be available; and (ii) instead of proximal operators, using subgradients to handle complex penalty functions may be more efficient and realistic. Motivated by these concerns, we analyze three potentially valuable extensions of TOS. The first two permit using subgradients and stochastic gradients, and are shown to ensure a $\mathcal{O}(1/\sqrt{t})$ convergence rate. The third extension AdapTOS endows TOS with adaptive step-sizes. For the important setting of optimizing a convex loss over the intersection of convex sets AdapTOS attains universal convergence rates, i.e., the rate adapts to the unknown smoothness degree of the objective. We compare our proposed methods with competing methods on various applications.
△ Less
Submitted 18 February, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Can contrastive learning avoid shortcut solutions?
Authors:
Joshua Robinson,
Li Sun,
Ke Yu,
Kayhan Batmanghelich,
Stefanie Jegelka,
Suvrit Sra
Abstract:
The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i.e., by inadvertently suppressing important predictive features.…
▽ More
The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i.e., by inadvertently suppressing important predictive features. We find that feature extraction is influenced by the difficulty of the so-called instance discrimination task (i.e., the task of discriminating pairs of similar points from pairs of dissimilar ones). Although harder pairs improve the representation of some features, the improvement comes at the cost of suppressing previously well represented features. In response, we propose implicit feature modification (IFM), a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. Empirically, we observe that IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks. The code is available at: \url{https://github.com/joshr17/IFM}.
△ Less
Submitted 19 December, 2021; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
Authors:
Chulhee Yun,
Suvrit Sra,
Ali Jadbabaie
Abstract:
We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restri…
▽ More
We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restrict our attention to positive definite matrices with small enough condition numbers, which are more relevant to matrices that arise in the analysis of SGD. For such matrices, we conjecture that the means of matrix products corresponding to with- and without-replacement variants of SGD satisfy a series of spectral norm inequalities that can be summarized as: "single-shuffle SGD converges faster than random-reshuffle SGD, which is in turn faster than with-replacement SGD." We present theorems that support our conjecture by proving several special cases.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
Three Operator Splitting with a Nonconvex Loss Function
Authors:
Alp Yurtsever,
Varun Mangalick,
Suvrit Sra
Abstract:
We consider the problem of minimizing the sum of three functions, one of which is nonconvex but differentiable, and the other two are convex but possibly nondifferentiable. We investigate the Three Operator Splitting method (TOS) of Davis & Yin (2017) with an aim to extend its theoretical guarantees for this nonconvex problem template. In particular, we prove convergence of TOS with nonasymptotic…
▽ More
We consider the problem of minimizing the sum of three functions, one of which is nonconvex but differentiable, and the other two are convex but possibly nondifferentiable. We investigate the Three Operator Splitting method (TOS) of Davis & Yin (2017) with an aim to extend its theoretical guarantees for this nonconvex problem template. In particular, we prove convergence of TOS with nonasymptotic bounds on its nonstationarity and infeasibility errors. In contrast with the existing work on nonconvex TOS, our guarantees do not require additional smoothness assumptions on the terms comprising the objective; hence they cover instances of particular interest where the nondifferentiable terms are indicator functions. We also extend our results to a stochastic setting where we have access only to an unbiased estimator of the gradient. Finally, we illustrate the effectiveness of the proposed method through numerical experiments on quadratic assignment problems.
△ Less
Submitted 13 June, 2021; v1 submitted 8 March, 2021;
originally announced March 2021.
-
Provably Efficient Algorithms for Multi-Objective Competitive RL
Authors:
Tiancheng Yu,
Yi Tian,
**gzhao Zhang,
Suvrit Sra
Abstract:
We study multi-objective reinforcement learning (RL) where an agent's reward is represented as a vector. In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set. We develop statistically and computationally efficient algorithms to approach the associated target set. Our results extend Blackwell's approachabilit…
▽ More
We study multi-objective reinforcement learning (RL) where an agent's reward is represented as a vector. In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set. We develop statistically and computationally efficient algorithms to approach the associated target set. Our results extend Blackwell's approachability theorem (Blackwell, 1956) to tabular RL, where strategic exploration becomes essential. The algorithms presented are adaptive; their guarantees hold even without Blackwell's approachability condition. If the opponents use fixed policies, we give an improved rate of approaching the target set while also tackling the more ambitious goal of simultaneously minimizing a scalar cost function. We discuss our analysis for this special case by relating our results to previous works on constrained RL. To our knowledge, this work provides the first provably efficient algorithms for vector-valued Markov games and our theoretical guarantees are near-optimal.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
Why do classifier accuracies show linear trends under distribution shift?
Authors:
Horia Mania,
Suvrit Sra
Abstract:
Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We explain this trend under an intuitive assumption on model similarity, which was verified empirically in prior work. More precisely, we assume the probability that two models agree in their pr…
▽ More
Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We explain this trend under an intuitive assumption on model similarity, which was verified empirically in prior work. More precisely, we assume the probability that two models agree in their predictions is higher than what we can infer from their accuracy levels alone. Then, we show that a linear trend must occur when evaluating models on two distributions unless the size of the distribution shift is large. This work emphasizes the value of understanding model similarity, which can have an impact on the generalization and robustness of classification models.
△ Less
Submitted 22 February, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.
-
Online Learning in Unknown Markov Games
Authors:
Yi Tian,
Yuanhao Wang,
Tiancheng Yu,
Suvrit Sra
Abstract:
We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game…
▽ More
We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game, and present an algorithm that achieves a sublinear $\tilde{\mathcal{O}}(K^{2/3})$ regret after $K$ episodes. This is the first sublinear regret bound (to our knowledge) for online learning in unknown Markov games. Importantly, our regret bound is independent of the size of the opponents' action spaces. As a result, even when the opponents' actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents.
△ Less
Submitted 6 February, 2021; v1 submitted 28 October, 2020;
originally announced October 2020.
-
Co** with Label Shift via Distributionally Robust Optimisation
Authors:
**gzhao Zhang,
Aditya Menon,
Andreas Veit,
Srinadh Bhojanapalli,
Sanjiv Kumar,
Suvrit Sra
Abstract:
The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, thei…
▽ More
The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in \emph{multiple} test environments. Can one instead learn a \emph{single} classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. %, and establish its convergence. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.
△ Less
Submitted 17 August, 2021; v1 submitted 23 October, 2020;
originally announced October 2020.
-
Contrastive Learning with Hard Negative Samples
Authors:
Joshua Robinson,
Ching-Yao Chuang,
Suvrit Sra,
Stefanie Jegelka
Abstract:
How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negati…
▽ More
How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.
△ Less
Submitted 24 January, 2021; v1 submitted 9 October, 2020;
originally announced October 2020.
-
Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes
Authors:
Yi Tian,
Jian Qian,
Suvrit Sra
Abstract:
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational comp…
▽ More
We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational complexity with a slightly worse regret. A key new ingredient of our algorithms is the design of a bonus term to guide exploration. We complement our algorithms by presenting several structure-dependent lower bounds on regret for FMDPs that reveal the difficulty hiding in the intricacy of the structures.
△ Less
Submitted 23 June, 2020;
originally announced June 2020.
-
SGD with shuffling: optimal rates without component convexity and large epoch requirements
Authors:
Kwangjun Ahn,
Chulhee Yun,
Suvrit Sra
Abstract:
We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is ge…
▽ More
We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is general enough to cover gradient dominated nonconvex costs, and does not rely on the convexity of individual component functions unlike existing optimal convergence results. Secondly, assuming convexity of the individual components, we further sharpen the tight convergence results for RandomShuffle by removing the drawbacks common to all prior arts: large number of epochs required for the results to hold, and extra poly-log factor gaps to the lower bound.
△ Less
Submitted 21 June, 2020; v1 submitted 12 June, 2020;
originally announced June 2020.
-
Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity
Authors:
**gzhao Zhang,
Hongzhou Lin,
Subhro Das,
Suvrit Sra,
Ali Jadbabaie
Abstract:
We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular,…
▽ More
We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular, we first summarize known instance-dependent complexity results and categorize them into three levels. We identify the domination relation between different levels and propose a fourth instance-dependent bound that dominates existing ones. We then provide a sufficient condition according to which an adaptive algorithm with moment estimation can achieve the proposed bound without knowledge of noise levels. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation as it achieves improved instance complexity.
△ Less
Submitted 17 June, 2022; v1 submitted 8 June, 2020;
originally announced June 2020.
-
Understanding Nesterov's Acceleration via Proximal Point Method
Authors:
Kwangjun Ahn,
Suvrit Sra
Abstract:
The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with convergence analyses of different versions of Nesterov's accelerated gradient method (AGM). The key observation is that AGM is a simple approximation of PPM, wh…
▽ More
The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with convergence analyses of different versions of Nesterov's accelerated gradient method (AGM). The key observation is that AGM is a simple approximation of PPM, which results in an elementary derivation of the update equations and stepsizes of AGM. This view also leads to a transparent and conceptually simple analysis of AGM's convergence by using the analysis of PPM. The derivations also naturally extend to the strongly convex case. Ultimately, the results presented in this paper are of both didactic and conceptual value; they unify and explain existing variants of AGM while motivating other accelerated methods for practically relevant settings.
△ Less
Submitted 2 June, 2022; v1 submitted 17 May, 2020;
originally announced May 2020.
-
On Tight Convergence Rates of Without-replacement SGD
Authors:
Kwangjun Ahn,
Suvrit Sra
Abstract:
For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD. Denoting by $n$ the number of components in the cost and $K$ the number of epochs of the algorithm , several recent works have shown convergence rates of without-replacement SGD that have better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD. However, ther…
▽ More
For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD. Denoting by $n$ the number of components in the cost and $K$ the number of epochs of the algorithm , several recent works have shown convergence rates of without-replacement SGD that have better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD. However, there are two main limitations shared among those works: the rates have extra poly-logarithmic factors on $nK$, and denoting by $κ$ the condition number of the problem, the rates hold after $κ^c\log(nK)$ epochs for some $c>0$. In this work, we overcome these limitations by analyzing step sizes that vary across epochs.
△ Less
Submitted 18 April, 2020;
originally announced April 2020.
-
Strength from Weakness: Fast Learning Using Weak Supervision
Authors:
Joshua Robinson,
Stefanie Jegelka,
Suvrit Sra
Abstract:
We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes…
▽ More
We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes the number of strongly labeled data points. This acceleration can happen even if by itself the strongly labeled data admits only the slower $\mathcal{O}(\nicefrac{1}{\sqrt{n}})$ rate. The actual acceleration depends continuously on the number of weak labels available, and on the relation between the two tasks. Our theoretical results are reflected empirically across a range of tasks and illustrate how weak labels speed up learning on the strong task.
△ Less
Submitted 19 February, 2020;
originally announced February 2020.
-
Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions
Authors:
**gzhao Zhang,
Hongzhou Lin,
Stefanie Jegelka,
Ali Jadbabaie,
Suvrit Sra
Abstract:
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We fi…
▽ More
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We first show that finding an $ε$-stationary point with first-order methods is impossible in finite time. We then introduce the notion of $(δ, ε)$-stationarity, which allows for an $ε$-approximate gradient to be the convex combination of generalized gradients evaluated at points within distance $δ$ to the solution. We propose a series of randomized first-order methods and analyze their complexity of finding a $(δ, ε)$-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on $δ$. Empirically, our methods perform well for training ReLU neural networks.
△ Less
Submitted 29 June, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
From Nesterov's Estimate Sequence to Riemannian Acceleration
Authors:
Kwangjun Ahn,
Suvrit Sra
Abstract:
We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion…
▽ More
We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion.'' We control this distortion by develo** a novel geometric inequality, which permits us to propose and analyze a Riemannian counterpart to Nesterov's accelerated gradient method.
△ Less
Submitted 23 January, 2020;
originally announced January 2020.
-
Why are Adaptive Methods Good for Attention Models?
Authors:
**gzhao Zhang,
Sai Praneeth Karimireddy,
Andreas Veit,
Seungyeon Kim,
Sashank J Reddi,
Sanjiv Kumar,
Suvrit Sra
Abstract:
While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a h…
▽ More
While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clip** plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clip** can be applied in practice by develo** an \emph{adaptive} coordinate-wise clip** algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks.
△ Less
Submitted 23 October, 2020; v1 submitted 6 December, 2019;
originally announced December 2019.
-
Learning Adversarial MDPs with Bandit Feedback and Unknown Transition
Authors:
Chi **,
Tiancheng **,
Haipeng Luo,
Suvrit Sra,
Tiancheng Yu
Abstract:
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of ep…
▽ More
We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.
△ Less
Submitted 2 November, 2020; v1 submitted 3 December, 2019;
originally announced December 2019.
-
Metrics Induced by Jensen-Shannon and Related Divergences on Positive Definite Matrices
Authors:
Suvrit Sra
Abstract:
We study metric properties of symmetric divergences on Hermitian positive definite matrices. In particular, we prove that the square root of these divergences is a distance metric. As a corollary we obtain a proof of the metric property for Quantum Jensen-Shannon-(Tsallis) divergences (parameterized by $α\in [0,2]$), which in turn (for $α=1$) yields a proof of the metric property of the Quantum Je…
▽ More
We study metric properties of symmetric divergences on Hermitian positive definite matrices. In particular, we prove that the square root of these divergences is a distance metric. As a corollary we obtain a proof of the metric property for Quantum Jensen-Shannon-(Tsallis) divergences (parameterized by $α\in [0,2]$), which in turn (for $α=1$) yields a proof of the metric property of the Quantum Jensen-Shannon divergence that was conjectured by Lamberti \emph{et al.} a decade ago (\emph{Metric character of the quantum Jensen-Shannon divergence}, Phy.\ Rev.\ A, \textbf{79}, (2008).) A somewhat more intricate argument also establishes metric properties of Jensen-Rényi divergences (for $α\in (0,1)$), and outlines a technique that may be of independent interest.
△ Less
Submitted 15 December, 2019; v1 submitted 6 November, 2019;
originally announced November 2019.
-
Projection-free nonconvex stochastic optimization on Riemannian manifolds
Authors:
Melanie Weber,
Suvrit Sra
Abstract:
We study stochastic projection-free methods for constrained optimization of smooth functions on Riemannian manifolds, i.e., with additional constraints beyond the parameter domain being a manifold. Specifically, we introduce stochastic Riemannian Frank-Wolfe methods for nonconvex and geodesically convex problems. We present algorithms for both purely stochastic optimization and finite-sum problems…
▽ More
We study stochastic projection-free methods for constrained optimization of smooth functions on Riemannian manifolds, i.e., with additional constraints beyond the parameter domain being a manifold. Specifically, we introduce stochastic Riemannian Frank-Wolfe methods for nonconvex and geodesically convex problems. We present algorithms for both purely stochastic optimization and finite-sum problems. For the latter, we develop variance-reduced methods, including a Riemannian adaptation of the recently proposed Spider technique. For all settings, we recover convergence rates that are comparable to the best-known rates for their Euclidean counterparts. Finally, we discuss applications to two classic tasks: The computation of the Karcher mean of positive definite matrices and Wasserstein barycenters for multivariate normal distributions. For both tasks, stochastic Fw methods yield state-of-the-art empirical performance.
△ Less
Submitted 3 April, 2021; v1 submitted 9 October, 2019;
originally announced October 2019.
-
Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation
Authors:
Tiancheng Yu,
Suvrit Sra
Abstract:
A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by dev…
▽ More
A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by develo** an Adversarial Reinforcement Learning (ARL) algorithm that reduces our MDP to a sequence of \emph{adversarial} bandit problems. ARL achieves $O(\sqrt{SATH^3})$ regret, which is optimal with respect to $S$, $A$, and $T$, and its dependence on $H$ is the best (even for the usual stationary MDP) among existing model-free methods.
△ Less
Submitted 21 August, 2019; v1 submitted 22 July, 2019;
originally announced July 2019.
-
Are deep ResNets provably better than linear predictors?
Authors:
Chulhee Yun,
Suvrit Sra,
Ali Jadbabaie
Abstract:
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residu…
▽ More
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the "near-identity regions" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.
△ Less
Submitted 29 October, 2019; v1 submitted 8 July, 2019;
originally announced July 2019.
-
Near Optimal Stratified Sampling
Authors:
Tiancheng Yu,
Xiyu Zhai,
Suvrit Sra
Abstract:
The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits sta…
▽ More
The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to log-factors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms.
△ Less
Submitted 26 July, 2019; v1 submitted 26 June, 2019;
originally announced June 2019.