Search | arXiv e-print repository

First-Order Methods for Linearly Constrained Bilevel Optimization

Authors: Guy Kornowski, Swati Padmanabhan, Kai Wang, Zhe Zhang, Suvrit Sra

Abstract: Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present first-order linearly constrained optimization methods with finite-time hypergradient stationarity guarantees. For linear equality cons… ▽ More Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present first-order linearly constrained optimization methods with finite-time hypergradient stationarity guarantees. For linear equality constraints, we attain $ε$-stationarity in $\widetilde{O}(ε^{-2})$ gradient oracle calls, which is nearly-optimal. For linear inequality constraints, we attain $(δ,ε)$-Goldstein stationarity in $\widetilde{O}(d{δ^{-1} ε^{-3}})$ gradient oracle calls, where $d$ is the upper-level dimension. Finally, we obtain for the linear inequality setting dimension-free rates of $\widetilde{O}({δ^{-1} ε^{-4}})$ oracle complexity under the additional assumption of oracle access to the optimal dual variable. Along the way, we develop new nonsmooth nonconvex optimization methods with inexact oracles. We verify these guarantees with preliminary numerical experiments. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2405.15816 [pdf, other]

Riemannian Bilevel Optimization

Authors: Sanchayan Dutta, Xiang Cheng, Suvrit Sra

Abstract: We develop new algorithms for Riemannian bilevel optimization. We focus in particular on batch and stochastic gradient-based methods, with the explicit goal of avoiding second-order information such as Riemannian hyper-gradients. We propose and analyze $\mathrm{RF^2SA}$, a method that leverages first-order gradient information to navigate the complex geometry of Riemannian manifolds efficiently. N… ▽ More We develop new algorithms for Riemannian bilevel optimization. We focus in particular on batch and stochastic gradient-based methods, with the explicit goal of avoiding second-order information such as Riemannian hyper-gradients. We propose and analyze $\mathrm{RF^2SA}$, a method that leverages first-order gradient information to navigate the complex geometry of Riemannian manifolds efficiently. Notably, $\mathrm{RF^2SA}$ is a single-loop algorithm, and thus easier to implement and use. Under various setups, including stochastic optimization, we provide explicit convergence rates for reaching $ε$-stationary points. We also address the challenge of optimizing over Riemannian manifolds with constraints by adjusting the multiplier in the Lagrangian, ensuring convergence to the desired solution without requiring access to second-order derivatives. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2402.10357 [pdf, other]

Efficient Sampling on Riemannian Manifolds via Langevin MCMC

Authors: Xiang Cheng, **gzhao Zhang, Suvrit Sra

Abstract: We study the task of efficiently sampling from a Gibbs distribution $d π^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice. The key to our analysis of Langevin MCMC is a bound on the discretization error of the geometric Euler-Murayama sche… ▽ More We study the task of efficiently sampling from a Gibbs distribution $d π^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice. The key to our analysis of Langevin MCMC is a bound on the discretization error of the geometric Euler-Murayama scheme, assuming $\nabla h$ is Lipschitz and $M$ has bounded sectional curvature. Our error bound matches the error of Euclidean Euler-Murayama in terms of its stepsize dependence. Combined with a contraction guarantee for the geometric Langevin Diffusion under Kendall-Cranston coupling, we prove that the Langevin MCMC iterates lie within $ε$-Wasserstein distance of $π^*$ after $\tilde{O}(ε^{-2})$ steps, which matches the iteration complexity for Euclidean Langevin MCMC. Our results apply in general settings where $h$ can be nonconvex and $M$ can have negative Ricci curvature. Under additional assumptions that the Riemannian curvature tensor has bounded derivatives, and that $π^*$ satisfies a $CD(\cdot,\infty)$ condition, we analyze the stochastic gradient version of Langevin MCMC, and bound its iteration complexity by $\tilde{O}(ε^{-2})$ as well. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: This is an old paper from NeurIPS 2022. arXiv admin note: text overlap with arXiv:2204.13665

arXiv:2312.06528 [pdf, other]

Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Authors: Xiang Cheng, Yuxin Chen, Suvrit Sra

Abstract: Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. This paper provides theoretical and empirical evidence that (non-linear) Transformers naturally learn to implement gradient descent in funct… ▽ More Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. This paper provides theoretical and empirical evidence that (non-linear) Transformers naturally learn to implement gradient descent in function space, which in turn enable them to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures and non-linear in-context learning tasks. Additionally, we show that the optimal choice of non-linear activation depends in a natural way on the class of functions that need to be learned. △ Less

Submitted 3 June, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2310.01082 [pdf, other]

Linear attention is (maybe) all you need (to understand transformer optimization)

Authors: Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra

Abstract: Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and… ▽ More Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization. △ Less

Submitted 13 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: Published at ICLR 2024

arXiv:2307.04456 [pdf, other]

Invex Programs: First Order Algorithms and Their Convergence

Authors: Adarsh Barik, Suvrit Sra, Jean Honorio

Abstract: Invex programs are a special kind of non-convex problems which attain global minima at every stationary point. While classical first-order gradient descent methods can solve them, they converge very slowly. In this paper, we propose new first-order algorithms to solve the general class of invex problems. We identify sufficient conditions for convergence of our algorithms and provide rates of conve… ▽ More Invex programs are a special kind of non-convex problems which attain global minima at every stationary point. While classical first-order gradient descent methods can solve them, they converge very slowly. In this paper, we propose new first-order algorithms to solve the general class of invex problems. We identify sufficient conditions for convergence of our algorithms and provide rates of convergence. Furthermore, we go beyond unconstrained problems and provide a novel projected gradient method for constrained invex programs with convergence rate guarantees. We compare and contrast our results with existing first-order algorithms for a variety of unconstrained and constrained invex problems. To the best of our knowledge, our proposed algorithm is the first algorithm to solve constrained invex programs. △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2306.00297 [pdf, other]

Transformers learn to implement preconditioned gradient descent for in-context learning

Authors: Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, Suvrit Sra

Abstract: Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instance… ▽ More Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instances? To our knowledge, we make the first theoretical progress on this question via an analysis of the loss landscape for linear transformers trained over random instances of linear regression. For a single attention layer, we prove the global minimum of the training objective implements a single iteration of preconditioned gradient descent. Notably, the preconditioning matrix not only adapts to the input distribution but also to the variance induced by data inadequacy. For a transformer with $L$ attention layers, we prove certain critical points of the training objective implement $L$ iterations of preconditioned gradient descent. Our results call for future theoretical studies on learning algorithms by training transformers. △ Less

Submitted 9 November, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

Comments: Improved presentation and added new results for the nonlinear activation case; 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

Journal ref: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2305.15659 [pdf, other]

How to escape sharp minima with random perturbations

Authors: Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra

Abstract: Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use… ▽ More Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, we then analyze algorithms that find approximate flat minima efficiently. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice. △ Less

Submitted 25 May, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: Accepted at ICML 2024

arXiv:2305.15287 [pdf, other]

The Crucial Role of Normalization in Sharpness-Aware Minimization

Authors: Yan Dai, Kwangjun Ahn, Suvrit Sra

Abstract: Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically an… ▽ More Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima -- a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments. △ Less

Submitted 23 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 30 pages, Published in 37th Neural Information Processing Systems (NeurIPS 2023)

arXiv:2302.12444 [pdf, other]

On the Training Instability of Shuffling SGD with Batch Normalization

Authors: David X. Wu, Chulhee Yun, Suvrit Sra

Abstract: We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for r… ▽ More We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice. △ Less

Submitted 14 August, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: ICML 2023 camera-ready version, added references; 75 pages

arXiv:2301.08342 [pdf, other]

The Hornich-Hlawka functional inequality for functions with positive differences

Authors: Constantin P. Niculescu, Suvrit Sra

Abstract: We analyze the role played by $n$-convexity for the fulfillment of a series of linear functional inequalities that extend the Hornich-Hlawka functional inequality, $f\left( x\right) +f\left( y\right) +f\left( z\right) +f\left( x+y+z\right) \geq f\left( x+y\right) +f\left( y+z\right)+f\left( z+x\right) +f(0),$ including extensions to the case of positive operators. We analyze the role played by $n$-convexity for the fulfillment of a series of linear functional inequalities that extend the Hornich-Hlawka functional inequality, $f\left( x\right) +f\left( y\right) +f\left( z\right) +f\left( x+y+z\right) \geq f\left( x+y\right) +f\left( y+z\right)+f\left( z+x\right) +f(0),$ including extensions to the case of positive operators. △ Less

Submitted 19 January, 2023; originally announced January 2023.

Comments: 20 pages

MSC Class: Primary 26B25; Secondary 26B35; 26D15; 26A48; 26A51

arXiv:2212.14511 [pdf, other]

Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?

Authors: Yi Tian, Kaiqing Zhang, Russ Tedrake, Suvrit Sra

Abstract: We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particul… ▽ More We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations. △ Less

Submitted 13 March, 2024; v1 submitted 29 December, 2022; originally announced December 2022.

Comments: 37 pages; Updated structure and proofs

arXiv:2208.05013 [pdf, other]

Computing Brascamp-Lieb Constants through the lens of Thompson Geometry

Authors: Melanie Weber, Suvrit Sra

Abstract: This paper studies algorithms for efficiently computing Brascamp-Lieb constants, a task that has recently received much interest. In particular, we reduce the computation to a nonlinear matrix-valued iteration, whose convergence we analyze through the lens of fixed-point methods under the well-known Thompson metric. This approach permits us to obtain (weakly) polynomial time guarantees, and it off… ▽ More This paper studies algorithms for efficiently computing Brascamp-Lieb constants, a task that has recently received much interest. In particular, we reduce the computation to a nonlinear matrix-valued iteration, whose convergence we analyze through the lens of fixed-point methods under the well-known Thompson metric. This approach permits us to obtain (weakly) polynomial time guarantees, and it offers an efficient and transparent alternative to previous state-of-the-art approaches based on Riemannian optimization and geodesic convexity. △ Less

Submitted 14 April, 2024; v1 submitted 9 August, 2022; originally announced August 2022.

Comments: Under Review

MSC Class: 46N10; 49Q99; 53Z50; 68W40

arXiv:2206.12014 [pdf, ps, other]

CCCP is Frank-Wolfe in disguise

Authors: Alp Yurtsever, Suvrit Sra

Abstract: This paper uncovers a simple but rather surprising connection: it shows that the well-known convex-concave procedure (CCCP) and its generalization to constrained problems are both special cases of the Frank-Wolfe (FW) method. This connection not only provides insight of deep (in our opinion) pedagogical value, but also transfers the recently discovered convergence theory of nonconvex Frank-Wolfe m… ▽ More This paper uncovers a simple but rather surprising connection: it shows that the well-known convex-concave procedure (CCCP) and its generalization to constrained problems are both special cases of the Frank-Wolfe (FW) method. This connection not only provides insight of deep (in our opinion) pedagogical value, but also transfers the recently discovered convergence theory of nonconvex Frank-Wolfe methods immediately to CCCP, closing a long-standing gap in its non-asymptotic convergence theory. We hope the viewpoint uncovered by this paper spurs the transfer of other advances made for FW to both CCCP and its generalizations. △ Less

Submitted 23 June, 2022; originally announced June 2022.

arXiv:2206.11426 [pdf, other]

On a class of geodesically convex optimization problems solved via Euclidean MM methods

Authors: Melanie Weber, Suvrit Sra

Abstract: We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality… ▽ More We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality along with guarantees on iteration complexity. On the other hand, the split structure permits us to develop Euclidean Majorization-Minorization algorithms that help us bypass the need to compute expensive Riemannian operations such as exponential maps and parallel transport. We illustrate our results by specializing them to a few concrete optimization problems that have been previously studied in the machine learning literature. Ultimately, we hope our work helps motivate the broader search for mixed Euclidean-Riemannian optimization algorithms △ Less

Submitted 20 October, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: Under Review

arXiv:2204.13665 [pdf, other]

Theory and Algorithms for Diffusion Processes on Riemannian Manifolds

Authors: Xiang Cheng, **gzhao Zhang, Suvrit Sra

Abstract: We study geometric stochastic differential equations (SDEs) and their approximations on Riemannian manifolds. In particular, we introduce a simple new construction of geometric SDEs, using which with bounded curvature. In particular, we provide the first (to our knowledge) non-asymptotic bound on the error of the geometric Euler-Murayama discretization. We then bound the distance between the exact… ▽ More We study geometric stochastic differential equations (SDEs) and their approximations on Riemannian manifolds. In particular, we introduce a simple new construction of geometric SDEs, using which with bounded curvature. In particular, we provide the first (to our knowledge) non-asymptotic bound on the error of the geometric Euler-Murayama discretization. We then bound the distance between the exact SDE and a discrete geometric random walk, where the noise can be non-Gaussian; this analysis is useful for using geometric SDEs to model naturally occurring discrete non-Gaussian stochastic processes. Our results provide convenient tools for studying MCMC algorithms that adopt non-standard noise distributions. △ Less

Submitted 20 November, 2023; v1 submitted 28 April, 2022; originally announced April 2022.

arXiv:2204.01050 [pdf, ps, other]

Understanding the unstable convergence of gradient descent

Authors: Kwangjun Ahn, **gzhao Zhang, Suvrit Sra

Abstract: Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from fir… ▽ More Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from first principles, and discuss key causes behind it. We also identify its main characteristics, and how they interrelate based on both theory and experiments, offering a principled view toward understanding the phenomenon. △ Less

Submitted 9 June, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

Comments: Accepted to the 39th International Conference on Machine Learning (ICML 2022), Baltimore, Maryland, USA. Version 2 improves writing and presentation, adds discussion regarding concurrent works

arXiv:2202.13013 [pdf, other]

Sign and Basis Invariant Networks for Spectral Graph Representation Learning

Authors: Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, Stefanie Jegelka

Abstract: We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i… ▽ More We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i.e., they can approximate any continuous function of eigenvectors with the desired invariances. When used with Laplacian eigenvectors, our networks are provably more expressive than existing spectral methods on graphs; for instance, they subsume all spectral graph convolutions, certain spectral graph invariants, and previously proposed graph positional encodings as special cases. Experiments show that our networks significantly outperform existing baselines on molecular graph regression, learning expressive graph representations, and learning neural fields on triangle meshes. Our code is available at https://github.com/cptq/SignNet-BasisNet . △ Less

Submitted 30 September, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: 42 pages

arXiv:2202.06950 [pdf, other]

Sion's Minimax Theorem in Geodesic Metric Spaces and a Riemannian Extragradient Algorithm

Authors: Peiyuan Zhang, **gzhao Zhang, Suvrit Sra

Abstract: Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. This paper takes a step towards understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems over geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems.… ▽ More Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. This paper takes a step towards understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems over geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and broadly accessible as it relies on the finite intersection property alone. The second main result is a specialization to geodesically complete Riemannian manifolds: here, we devise and analyze the complexity of first-order methods for smooth minimax problems. △ Less

Submitted 28 May, 2023; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: 23 pages, 3 figures

arXiv:2112.14862 [pdf, ps, other]

Time varying regression with hidden linear dynamics

Authors: Ali Jadbabaie, Horia Mania, Devavrat Shah, Suvrit Sra

Abstract: We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and d… ▽ More We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 22 pages

arXiv:2112.11450 [pdf, other]

Max-Margin Contrastive Learning

Authors: Anshul Shah, Suvrit Sra, Rama Chellappa, Anoop Cherian

Abstract: Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learni… ▽ More Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learning (MMCL). Our approach selects negatives as the sparse support vectors obtained via a quadratic optimization problem, and contrastiveness is enforced by maximizing the decision margin. As SVM optimization can be computationally demanding, especially in an end-to-end setting, we present simplifications that alleviate the computational burden. We validate our approach on standard vision benchmark datasets, demonstrating better performance in unsupervised representation learning over state-of-the-art, while having better empirical convergence properties. △ Less

Submitted 21 December, 2021; originally announced December 2021.

Comments: Accepted at AAAI 2022

arXiv:2112.00056 [pdf, ps, other]

Positive definite functions of noncommuting contractions, Hua-Bellman matrices, and a new distance metric

Authors: Suvrit Sra

Abstract: We study positive definite functions on noncommuting strict contractions. In particular, we study functions that induce positive definite Hua-Bellman matrices (i.e., matrices of the form $[\det(I-A_i^*A_j)^{-α}]_{ij}$ where $A_i$ and $A_j$ are strict contractions and $α\in\mathbb{C}$). We start by revisiting a 1959 work of \citeauthor{bellman1959} (R.~Bellman~\emph{Representation theorems and ineq… ▽ More We study positive definite functions on noncommuting strict contractions. In particular, we study functions that induce positive definite Hua-Bellman matrices (i.e., matrices of the form $[\det(I-A_i^*A_j)^{-α}]_{ij}$ where $A_i$ and $A_j$ are strict contractions and $α\in\mathbb{C}$). We start by revisiting a 1959 work of \citeauthor{bellman1959} (R.~Bellman~\emph{Representation theorems and inequalities for Hermitian matrices}; Duke Mathematical J., 26(3), 1959) that studies Hua-Bellman matrices and claims a strengthening of \citeauthor{hua1955}'s representation theoretic results on their positive definiteness~(L.-K. Hua, \emph{Inequalities involving determinants}; Acta Mathematica Sinica, 5(1955), pp.~463--470). We uncover a critical error in Bellman's proof that has surprisingly escaped notice to date. We "fix" this error and provide conditions under which $\det(I-A^*B)^{-α}$ is a positive definite function; our conditions correct Bellman's claim and subsume both Bellman's and Hua's prior results. Subsequently, we build on our result and introduce a new hyperbolic-like geometry on noncommuting contractions, and remark on its potential applications. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: 11 pages

arXiv:2111.11855 [pdf, ps, other]

Introducing Discrepancy Values of Matrices with Application to Bounding Norms of Commutators

Authors: Pourya Habib Zadeh, Suvrit Sra

Abstract: We introduce discrepancy values, quantities inspired by the notion of the spectral spread of Hermitian matrices. We define them as the discrepancy between two consecutive Ky-Fan-like seminorms. As a result, discrepancy values share many properties with singular values and eigenvalues, yet are substantially different to merit their own study. We describe key properties of discrepancy values, and es… ▽ More We introduce discrepancy values, quantities inspired by the notion of the spectral spread of Hermitian matrices. We define them as the discrepancy between two consecutive Ky-Fan-like seminorms. As a result, discrepancy values share many properties with singular values and eigenvalues, yet are substantially different to merit their own study. We describe key properties of discrepancy values, and establish several tools such as representation theorems, majorization inequalities, convex formulations, etc., for working with them. As an important application, we illustrate the role of discrepancy values in deriving tight bounds on the norms of commutators. △ Less

Submitted 14 June, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

arXiv:2111.02763 [pdf, ps, other]

Understanding Riemannian Acceleration via a Proximal Extragradient Framework

Authors: Jikai **, Suvrit Sra

Abstract: We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of ne… ▽ More We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of new insights into Euclidean A-HPE itself; and (ii) a careful control of metric distortion caused by Riemannian geometry. We illustrate our framework by obtaining a few existing and new Riemannian accelerated gradient methods as special cases, while characterizing their acceleration as corollaries of our main results. △ Less

Submitted 9 February, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

arXiv:2110.10342 [pdf, other]

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

Authors: Chulhee Yun, Shashank Rajput, Suvrit Sra

Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients… ▽ More In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings. △ Less

Submitted 23 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: ICLR 2022 camera-ready (selected for an oral presentation); 76 pages, 3 figures

arXiv:2110.06256 [pdf, other]

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Authors: **gzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie

Abstract: This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the… ▽ More This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice. △ Less

Submitted 17 June, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Journal ref: ICML 2022

arXiv:2110.03274 [pdf, other]

Three Operator Splitting with Subgradients, Stochastic Gradients, and Adaptive Learning Rates

Authors: Alp Yurtsever, Alex Gu, Suvrit Sra

Abstract: Three Operator Splitting (TOS) (Davis & Yin, 2017) can minimize the sum of multiple convex functions effectively when an efficient gradient oracle or proximal operator is available for each term. This requirement often fails in machine learning applications: (i) instead of full gradients only stochastic gradients may be available; and (ii) instead of proximal operators, using subgradients to handl… ▽ More Three Operator Splitting (TOS) (Davis & Yin, 2017) can minimize the sum of multiple convex functions effectively when an efficient gradient oracle or proximal operator is available for each term. This requirement often fails in machine learning applications: (i) instead of full gradients only stochastic gradients may be available; and (ii) instead of proximal operators, using subgradients to handle complex penalty functions may be more efficient and realistic. Motivated by these concerns, we analyze three potentially valuable extensions of TOS. The first two permit using subgradients and stochastic gradients, and are shown to ensure a $\mathcal{O}(1/\sqrt{t})$ convergence rate. The third extension AdapTOS endows TOS with adaptive step-sizes. For the important setting of optimizing a convex loss over the intersection of convex sets AdapTOS attains universal convergence rates, i.e., the rate adapts to the unknown smoothness degree of the objective. We compare our proposed methods with competing methods on various applications. △ Less

Submitted 18 February, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Appears in the 35th Annual Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2106.11230 [pdf, other]

Can contrastive learning avoid shortcut solutions?

Authors: Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, Suvrit Sra

Abstract: The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i.e., by inadvertently suppressing important predictive features.… ▽ More The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i.e., by inadvertently suppressing important predictive features. We find that feature extraction is influenced by the difficulty of the so-called instance discrimination task (i.e., the task of discriminating pairs of similar points from pairs of dissimilar ones). Although harder pairs improve the representation of some features, the improvement comes at the cost of suppressing previously well represented features. In response, we propose implicit feature modification (IFM), a method for altering positive and negative samples in order to guide contrastive models towards capturing a wider variety of predictive features. Empirically, we observe that IFM reduces feature suppression, and as a result improves performance on vision and medical imaging tasks. The code is available at: \url{https://github.com/joshr17/IFM}. △ Less

Submitted 19 December, 2021; v1 submitted 21 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021

arXiv:2103.07079 [pdf, other]

Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restri… ▽ More We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restrict our attention to positive definite matrices with small enough condition numbers, which are more relevant to matrices that arise in the analysis of SGD. For such matrices, we conjecture that the means of matrix products corresponding to with- and without-replacement variants of SGD satisfy a series of spectral norm inequalities that can be summarized as: "single-shuffle SGD converges faster than random-reshuffle SGD, which is in turn faster than with-replacement SGD." We present theorems that support our conjecture by proving several special cases. △ Less

Submitted 11 March, 2021; originally announced March 2021.

Comments: 26 pages, 2 figures

arXiv:2103.04568 [pdf, other]

Three Operator Splitting with a Nonconvex Loss Function

Authors: Alp Yurtsever, Varun Mangalick, Suvrit Sra

Abstract: We consider the problem of minimizing the sum of three functions, one of which is nonconvex but differentiable, and the other two are convex but possibly nondifferentiable. We investigate the Three Operator Splitting method (TOS) of Davis & Yin (2017) with an aim to extend its theoretical guarantees for this nonconvex problem template. In particular, we prove convergence of TOS with nonasymptotic… ▽ More We consider the problem of minimizing the sum of three functions, one of which is nonconvex but differentiable, and the other two are convex but possibly nondifferentiable. We investigate the Three Operator Splitting method (TOS) of Davis & Yin (2017) with an aim to extend its theoretical guarantees for this nonconvex problem template. In particular, we prove convergence of TOS with nonasymptotic bounds on its nonstationarity and infeasibility errors. In contrast with the existing work on nonconvex TOS, our guarantees do not require additional smoothness assumptions on the terms comprising the objective; hence they cover instances of particular interest where the nondifferentiable terms are indicator functions. We also extend our results to a stochastic setting where we have access only to an unbiased estimator of the gradient. Finally, we illustrate the effectiveness of the proposed method through numerical experiments on quadratic assignment problems. △ Less

Submitted 13 June, 2021; v1 submitted 8 March, 2021; originally announced March 2021.

Comments: Appears in Proceedings of the 38th International Conference on Machine Learning (ICML 2021)

arXiv:2102.03192 [pdf, other]

Provably Efficient Algorithms for Multi-Objective Competitive RL

Authors: Tiancheng Yu, Yi Tian, **gzhao Zhang, Suvrit Sra

Abstract: We study multi-objective reinforcement learning (RL) where an agent's reward is represented as a vector. In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set. We develop statistically and computationally efficient algorithms to approach the associated target set. Our results extend Blackwell's approachabilit… ▽ More We study multi-objective reinforcement learning (RL) where an agent's reward is represented as a vector. In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set. We develop statistically and computationally efficient algorithms to approach the associated target set. Our results extend Blackwell's approachability theorem (Blackwell, 1956) to tabular RL, where strategic exploration becomes essential. The algorithms presented are adaptive; their guarantees hold even without Blackwell's approachability condition. If the opponents use fixed policies, we give an improved rate of approaching the target set while also tackling the more ambitious goal of simultaneously minimizing a scalar cost function. We discuss our analysis for this special case by relating our results to previous works on constrained RL. To our knowledge, this work provides the first provably efficient algorithms for vector-valued Markov games and our theoretical guarantees are near-optimal. △ Less

Submitted 5 February, 2021; originally announced February 2021.

arXiv:2012.15483 [pdf, other]

Why do classifier accuracies show linear trends under distribution shift?

Authors: Horia Mania, Suvrit Sra

Abstract: Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We explain this trend under an intuitive assumption on model similarity, which was verified empirically in prior work. More precisely, we assume the probability that two models agree in their pr… ▽ More Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We explain this trend under an intuitive assumption on model similarity, which was verified empirically in prior work. More precisely, we assume the probability that two models agree in their predictions is higher than what we can infer from their accuracy levels alone. Then, we show that a linear trend must occur when evaluating models on two distributions unless the size of the distribution shift is large. This work emphasizes the value of understanding model similarity, which can have an impact on the generalization and robustness of classification models. △ Less

Submitted 22 February, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

Comments: 18 pages, 13 figures

arXiv:2010.15020 [pdf, other]

Online Learning in Unknown Markov Games

Authors: Yi Tian, Yuanhao Wang, Tiancheng Yu, Suvrit Sra

Abstract: We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game… ▽ More We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game, and present an algorithm that achieves a sublinear $\tilde{\mathcal{O}}(K^{2/3})$ regret after $K$ episodes. This is the first sublinear regret bound (to our knowledge) for online learning in unknown Markov games. Importantly, our regret bound is independent of the size of the opponents' action spaces. As a result, even when the opponents' actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents. △ Less

Submitted 6 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Comments: 25 pages

arXiv:2010.12230 [pdf, other]

Co** with Label Shift via Distributionally Robust Optimisation

Authors: **gzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra

Abstract: The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, thei… ▽ More The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in \emph{multiple} test environments. Can one instead learn a \emph{single} classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. %, and establish its convergence. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present. △ Less

Submitted 17 August, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

arXiv:2010.04592 [pdf, other]

Contrastive Learning with Hard Negative Samples

Authors: Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, Stefanie Jegelka

Abstract: How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negati… ▽ More How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead. △ Less

Submitted 24 January, 2021; v1 submitted 9 October, 2020; originally announced October 2020.

Comments: Published as a conference paper at ICLR 2021

arXiv:2006.13405 [pdf, other]

Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

Authors: Yi Tian, Jian Qian, Suvrit Sra

Abstract: We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational comp… ▽ More We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational complexity with a slightly worse regret. A key new ingredient of our algorithms is the design of a bonus term to guide exploration. We complement our algorithms by presenting several structure-dependent lower bounds on regret for FMDPs that reveal the difficulty hiding in the intricacy of the structures. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: 54 pages

arXiv:2006.06946 [pdf, other]

SGD with shuffling: optimal rates without component convexity and large epoch requirements

Authors: Kwangjun Ahn, Chulhee Yun, Suvrit Sra

Abstract: We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is ge… ▽ More We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is general enough to cover gradient dominated nonconvex costs, and does not rely on the convexity of individual component functions unlike existing optimal convergence results. Secondly, assuming convexity of the individual components, we further sharpen the tight convergence results for RandomShuffle by removing the drawbacks common to all prior arts: large number of epochs required for the results to hold, and extra poly-log factor gaps to the lower bound. △ Less

Submitted 21 June, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 53 pages; supersedes the preprint arXiv:2004.08657; v2 corrects an erroneous claim about SingleShuffle and newly adds Theorem 24 and Appendix F for SingleShuffle

arXiv:2006.04429 [pdf, other]

Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity

Authors: **gzhao Zhang, Hongzhou Lin, Subhro Das, Suvrit Sra, Ali Jadbabaie

Abstract: We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular,… ▽ More We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular, we first summarize known instance-dependent complexity results and categorize them into three levels. We identify the domination relation between different levels and propose a fourth instance-dependent bound that dominates existing ones. We then provide a sufficient condition according to which an adaptive algorithm with moment estimation can achieve the proposed bound without knowledge of noise levels. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation as it achieves improved instance complexity. △ Less

Submitted 17 June, 2022; v1 submitted 8 June, 2020; originally announced June 2020.

Journal ref: ICML 2022

arXiv:2005.08304 [pdf, ps, other]

Understanding Nesterov's Acceleration via Proximal Point Method

Authors: Kwangjun Ahn, Suvrit Sra

Abstract: The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with convergence analyses of different versions of Nesterov's accelerated gradient method (AGM). The key observation is that AGM is a simple approximation of PPM, wh… ▽ More The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with convergence analyses of different versions of Nesterov's accelerated gradient method (AGM). The key observation is that AGM is a simple approximation of PPM, which results in an elementary derivation of the update equations and stepsizes of AGM. This view also leads to a transparent and conceptually simple analysis of AGM's convergence by using the analysis of PPM. The derivations also naturally extend to the strongly convex case. Ultimately, the results presented in this paper are of both didactic and conceptual value; they unify and explain existing variants of AGM while motivating other accelerated methods for practically relevant settings. △ Less

Submitted 2 June, 2022; v1 submitted 17 May, 2020; originally announced May 2020.

Comments: 14 pages; Presented at SIAM Symposium on Simplicity in Algorithms (SOSA22), January 10 - 11, 2022

arXiv:2004.08657 [pdf, ps, other]

On Tight Convergence Rates of Without-replacement SGD

Authors: Kwangjun Ahn, Suvrit Sra

Abstract: For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD. Denoting by $n$ the number of components in the cost and $K$ the number of epochs of the algorithm , several recent works have shown convergence rates of without-replacement SGD that have better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD. However, ther… ▽ More For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD. Denoting by $n$ the number of components in the cost and $K$ the number of epochs of the algorithm , several recent works have shown convergence rates of without-replacement SGD that have better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD. However, there are two main limitations shared among those works: the rates have extra poly-logarithmic factors on $nK$, and denoting by $κ$ the condition number of the problem, the rates hold after $κ^c\log(nK)$ epochs for some $c>0$. In this work, we overcome these limitations by analyzing step sizes that vary across epochs. △ Less

Submitted 18 April, 2020; originally announced April 2020.

Comments: 12 pages

arXiv:2002.08483 [pdf, other]

Strength from Weakness: Fast Learning Using Weak Supervision

Authors: Joshua Robinson, Stefanie Jegelka, Suvrit Sra

Abstract: We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes… ▽ More We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes the number of strongly labeled data points. This acceleration can happen even if by itself the strongly labeled data admits only the slower $\mathcal{O}(\nicefrac{1}{\sqrt{n}})$ rate. The actual acceleration depends continuously on the number of weak labels available, and on the relation between the two tasks. Our theoretical results are reflected empirically across a range of tasks and illustrate how weak labels speed up learning on the strong task. △ Less

Submitted 19 February, 2020; originally announced February 2020.

Comments: 21 pages, 8 figures

arXiv:2002.04130 [pdf, other]

Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions

Authors: **gzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Ali Jadbabaie, Suvrit Sra

Abstract: We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We fi… ▽ More We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We first show that finding an $ε$-stationary point with first-order methods is impossible in finite time. We then introduce the notion of $(δ, ε)$-stationarity, which allows for an $ε$-approximate gradient to be the convex combination of generalized gradients evaluated at points within distance $δ$ to the solution. We propose a series of randomized first-order methods and analyze their complexity of finding a $(δ, ε)$-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on $δ$. Empirically, our methods perform well for training ReLU neural networks. △ Less

Submitted 29 June, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

arXiv:2001.08876 [pdf, other]

From Nesterov's Estimate Sequence to Riemannian Acceleration

Authors: Kwangjun Ahn, Suvrit Sra

Abstract: We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion… ▽ More We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion.'' We control this distortion by develo** a novel geometric inequality, which permits us to propose and analyze a Riemannian counterpart to Nesterov's accelerated gradient method. △ Less

Submitted 23 January, 2020; originally announced January 2020.

Comments: 30 pages

arXiv:1912.03194 [pdf, other]

Why are Adaptive Methods Good for Attention Models?

Authors: **gzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra

Abstract: While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a h… ▽ More While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance. We provide the first tight upper and lower convergence bounds for adaptive gradient methods under heavy-tailed noise. Further, we demonstrate how gradient clip** plays a key role in addressing heavy-tailed gradient noise. Subsequently, we show how clip** can be applied in practice by develo** an \emph{adaptive} coordinate-wise clip** algorithm (ACClip) and demonstrate its superior performance on BERT pretraining and finetuning tasks. △ Less

Submitted 23 October, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

arXiv:1912.01192 [pdf, ps, other]

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Authors: Chi **, Tiancheng **, Haipeng Luo, Suvrit Sra, Tiancheng Yu

Abstract: We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of ep… ▽ More We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$. △ Less

Submitted 2 November, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

Comments: Fix a bug

MSC Class: I.2.6 ACM Class: I.2.6

arXiv:1911.02643 [pdf, ps, other]

Metrics Induced by Jensen-Shannon and Related Divergences on Positive Definite Matrices

Authors: Suvrit Sra

Abstract: We study metric properties of symmetric divergences on Hermitian positive definite matrices. In particular, we prove that the square root of these divergences is a distance metric. As a corollary we obtain a proof of the metric property for Quantum Jensen-Shannon-(Tsallis) divergences (parameterized by $α\in [0,2]$), which in turn (for $α=1$) yields a proof of the metric property of the Quantum Je… ▽ More We study metric properties of symmetric divergences on Hermitian positive definite matrices. In particular, we prove that the square root of these divergences is a distance metric. As a corollary we obtain a proof of the metric property for Quantum Jensen-Shannon-(Tsallis) divergences (parameterized by $α\in [0,2]$), which in turn (for $α=1$) yields a proof of the metric property of the Quantum Jensen-Shannon divergence that was conjectured by Lamberti \emph{et al.} a decade ago (\emph{Metric character of the quantum Jensen-Shannon divergence}, Phy.\ Rev.\ A, \textbf{79}, (2008).) A somewhat more intricate argument also establishes metric properties of Jensen-Rényi divergences (for $α\in (0,1)$), and outlines a technique that may be of independent interest. △ Less

Submitted 15 December, 2019; v1 submitted 6 November, 2019; originally announced November 2019.

Comments: 10 pages; reorganized presentation; added new section

MSC Class: 15A45; 52A99; 47B65; 94A17

arXiv:1910.04194 [pdf, other]

Projection-free nonconvex stochastic optimization on Riemannian manifolds

Authors: Melanie Weber, Suvrit Sra

Abstract: We study stochastic projection-free methods for constrained optimization of smooth functions on Riemannian manifolds, i.e., with additional constraints beyond the parameter domain being a manifold. Specifically, we introduce stochastic Riemannian Frank-Wolfe methods for nonconvex and geodesically convex problems. We present algorithms for both purely stochastic optimization and finite-sum problems… ▽ More We study stochastic projection-free methods for constrained optimization of smooth functions on Riemannian manifolds, i.e., with additional constraints beyond the parameter domain being a manifold. Specifically, we introduce stochastic Riemannian Frank-Wolfe methods for nonconvex and geodesically convex problems. We present algorithms for both purely stochastic optimization and finite-sum problems. For the latter, we develop variance-reduced methods, including a Riemannian adaptation of the recently proposed Spider technique. For all settings, we recover convergence rates that are comparable to the best-known rates for their Euclidean counterparts. Finally, we discuss applications to two classic tasks: The computation of the Karcher mean of positive definite matrices and Wasserstein barycenters for multivariate normal distributions. For both tasks, stochastic Fw methods yield state-of-the-art empirical performance. △ Less

Submitted 3 April, 2021; v1 submitted 9 October, 2019; originally announced October 2019.

Comments: Under Review

arXiv:1907.09350

Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

Authors: Tiancheng Yu, Suvrit Sra

Abstract: A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by dev… ▽ More A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by develo** an Adversarial Reinforcement Learning (ARL) algorithm that reduces our MDP to a sequence of \emph{adversarial} bandit problems. ARL achieves $O(\sqrt{SATH^3})$ regret, which is optimal with respect to $S$, $A$, and $T$, and its dependence on $H$ is the best (even for the usual stationary MDP) among existing model-free methods. △ Less

Submitted 21 August, 2019; v1 submitted 22 July, 2019; originally announced July 2019.

Comments: There is a problem in the Theorem 1. We will try to fix it and update a new version

arXiv:1907.03922 [pdf, ps, other]

Are deep ResNets provably better than linear predictors?

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residu… ▽ More Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the "near-identity regions" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity. △ Less

Submitted 29 October, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

Comments: 15 pages. NeurIPS 2019 Camera-ready version

arXiv:1906.11289

Near Optimal Stratified Sampling

Authors: Tiancheng Yu, Xiyu Zhai, Suvrit Sra

Abstract: The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits sta… ▽ More The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to log-factors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms. △ Less

Submitted 26 July, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

Comments: We have discovered a mistake in the main result. The quantity on the RHS of (3) is not equal to the variance of estimator (2) when the sampling rule is designed adaptively as we do. There will be further cross-product terms which are now dominant terms. Therefore, although our bound is correct for (3), it no longer implies bound of the variance of (2)

Showing 1–50 of 119 results for author: Suvrit