Skip to main content

Showing 1–50 of 83 results for author: Sra, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12771  [pdf, other

    math.OC cs.LG

    First-Order Methods for Linearly Constrained Bilevel Optimization

    Authors: Guy Kornowski, Swati Padmanabhan, Kai Wang, Zhe Zhang, Suvrit Sra

    Abstract: Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present first-order linearly constrained optimization methods with finite-time hypergradient stationarity guarantees. For linear equality cons… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  2. arXiv:2405.15816  [pdf, other

    math.OC cs.AI cs.LG

    Riemannian Bilevel Optimization

    Authors: Sanchayan Dutta, Xiang Cheng, Suvrit Sra

    Abstract: We develop new algorithms for Riemannian bilevel optimization. We focus in particular on batch and stochastic gradient-based methods, with the explicit goal of avoiding second-order information such as Riemannian hyper-gradients. We propose and analyze $\mathrm{RF^2SA}$, a method that leverages first-order gradient information to navigate the complex geometry of Riemannian manifolds efficiently. N… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  3. arXiv:2402.10357  [pdf, other

    math.ST cs.LG math.PR stat.CO stat.ML

    Efficient Sampling on Riemannian Manifolds via Langevin MCMC

    Authors: Xiang Cheng, **gzhao Zhang, Suvrit Sra

    Abstract: We study the task of efficiently sampling from a Gibbs distribution $d π^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice. The key to our analysis of Langevin MCMC is a bound on the discretization error of the geometric Euler-Murayama sche… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: This is an old paper from NeurIPS 2022. arXiv admin note: text overlap with arXiv:2204.13665

  4. arXiv:2312.06528  [pdf, other

    cs.LG

    Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

    Authors: Xiang Cheng, Yuxin Chen, Suvrit Sra

    Abstract: Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. This paper provides theoretical and empirical evidence that (non-linear) Transformers naturally learn to implement gradient descent in funct… ▽ More

    Submitted 3 June, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

  5. arXiv:2310.01082  [pdf, other

    cs.LG cs.AI math.OC

    Linear attention is (maybe) all you need (to understand transformer optimization)

    Authors: Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra

    Abstract: Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and… ▽ More

    Submitted 13 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Published at ICLR 2024

  6. arXiv:2307.04456  [pdf, other

    math.OC cs.LG

    Invex Programs: First Order Algorithms and Their Convergence

    Authors: Adarsh Barik, Suvrit Sra, Jean Honorio

    Abstract: Invex programs are a special kind of non-convex problems which attain global minima at every stationary point. While classical first-order gradient descent methods can solve them, they converge very slowly. In this paper, we propose new first-order algorithms to solve the general class of invex problems. We identify sufficient conditions for convergence of our algorithms and provide rates of conve… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

  7. arXiv:2306.00297  [pdf, other

    cs.LG cs.AI

    Transformers learn to implement preconditioned gradient descent for in-context learning

    Authors: Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, Suvrit Sra

    Abstract: Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient descent. Going beyond the question of expressivity, we ask: Can transformers learn to implement such algorithms by training over random problem instance… ▽ More

    Submitted 9 November, 2023; v1 submitted 31 May, 2023; originally announced June 2023.

    Comments: Improved presentation and added new results for the nonlinear activation case; 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

    Journal ref: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  8. arXiv:2305.15659  [pdf, other

    cs.LG cs.AI math.OC

    How to escape sharp minima with random perturbations

    Authors: Kwangjun Ahn, Ali Jadbabaie, Suvrit Sra

    Abstract: Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use… ▽ More

    Submitted 25 May, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted at ICML 2024

  9. arXiv:2305.15287  [pdf, other

    cs.LG cs.AI stat.ML

    The Crucial Role of Normalization in Sharpness-Aware Minimization

    Authors: Yan Dai, Kwangjun Ahn, Suvrit Sra

    Abstract: Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically an… ▽ More

    Submitted 23 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: 30 pages, Published in 37th Neural Information Processing Systems (NeurIPS 2023)

  10. arXiv:2302.12444  [pdf, other

    cs.LG math.OC

    On the Training Instability of Shuffling SGD with Batch Normalization

    Authors: David X. Wu, Chulhee Yun, Suvrit Sra

    Abstract: We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) -- two widely used variants of SGD -- interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for r… ▽ More

    Submitted 14 August, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: ICML 2023 camera-ready version, added references; 75 pages

  11. arXiv:2212.14511  [pdf, other

    cs.LG eess.SY math.OC stat.ML

    Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?

    Authors: Yi Tian, Kaiqing Zhang, Russ Tedrake, Suvrit Sra

    Abstract: We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particul… ▽ More

    Submitted 13 March, 2024; v1 submitted 29 December, 2022; originally announced December 2022.

    Comments: 37 pages; Updated structure and proofs

  12. arXiv:2208.05013  [pdf, other

    math.OC cs.CC cs.DS math.FA

    Computing Brascamp-Lieb Constants through the lens of Thompson Geometry

    Authors: Melanie Weber, Suvrit Sra

    Abstract: This paper studies algorithms for efficiently computing Brascamp-Lieb constants, a task that has recently received much interest. In particular, we reduce the computation to a nonlinear matrix-valued iteration, whose convergence we analyze through the lens of fixed-point methods under the well-known Thompson metric. This approach permits us to obtain (weakly) polynomial time guarantees, and it off… ▽ More

    Submitted 14 April, 2024; v1 submitted 9 August, 2022; originally announced August 2022.

    Comments: Under Review

    MSC Class: 46N10; 49Q99; 53Z50; 68W40

  13. arXiv:2206.11426  [pdf, other

    math.OC cs.LG

    On a class of geodesically convex optimization problems solved via Euclidean MM methods

    Authors: Melanie Weber, Suvrit Sra

    Abstract: We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality… ▽ More

    Submitted 20 October, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

    Comments: Under Review

  14. arXiv:2204.01050  [pdf, ps, other

    math.OC cs.LG

    Understanding the unstable convergence of gradient descent

    Authors: Kwangjun Ahn, **gzhao Zhang, Suvrit Sra

    Abstract: Most existing analyses of (stochastic) gradient descent rely on the condition that for $L$-smooth costs, the step size is less than $2/L$. However, many works have observed that in machine learning applications step sizes often do not fulfill this condition, yet (stochastic) gradient descent still converges, albeit in an unstable manner. We investigate this unstable convergence phenomenon from fir… ▽ More

    Submitted 9 June, 2022; v1 submitted 3 April, 2022; originally announced April 2022.

    Comments: Accepted to the 39th International Conference on Machine Learning (ICML 2022), Baltimore, Maryland, USA. Version 2 improves writing and presentation, adds discussion regarding concurrent works

  15. arXiv:2202.13013  [pdf, other

    cs.LG stat.ML

    Sign and Basis Invariant Networks for Spectral Graph Representation Learning

    Authors: Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, Stefanie Jegelka

    Abstract: We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i… ▽ More

    Submitted 30 September, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

    Comments: 42 pages

  16. arXiv:2202.06950  [pdf, other

    math.OC cs.LG stat.ML

    Sion's Minimax Theorem in Geodesic Metric Spaces and a Riemannian Extragradient Algorithm

    Authors: Peiyuan Zhang, **gzhao Zhang, Suvrit Sra

    Abstract: Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. This paper takes a step towards understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems over geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems.… ▽ More

    Submitted 28 May, 2023; v1 submitted 13 February, 2022; originally announced February 2022.

    Comments: 23 pages, 3 figures

  17. arXiv:2112.11450  [pdf, other

    cs.LG cs.AI cs.CV

    Max-Margin Contrastive Learning

    Authors: Anshul Shah, Suvrit Sra, Rama Chellappa, Anoop Cherian

    Abstract: Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learni… ▽ More

    Submitted 21 December, 2021; originally announced December 2021.

    Comments: Accepted at AAAI 2022

  18. arXiv:2111.02763  [pdf, ps, other

    math.OC cs.LG stat.ML

    Understanding Riemannian Acceleration via a Proximal Extragradient Framework

    Authors: Jikai **, Suvrit Sra

    Abstract: We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of ne… ▽ More

    Submitted 9 February, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

  19. arXiv:2110.10342  [pdf, other

    cs.LG math.OC stat.ML

    Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

    Authors: Chulhee Yun, Shashank Rajput, Suvrit Sra

    Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients… ▽ More

    Submitted 23 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: ICLR 2022 camera-ready (selected for an oral presentation); 76 pages, 3 figures

  20. arXiv:2110.06256  [pdf, other

    cs.LG math.OC stat.ML

    Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

    Authors: **gzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie

    Abstract: This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the… ▽ More

    Submitted 17 June, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Journal ref: ICML 2022

  21. arXiv:2106.11230  [pdf, other

    cs.LG

    Can contrastive learning avoid shortcut solutions?

    Authors: Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, Suvrit Sra

    Abstract: The generalization of representations learned via contrastive learning depends crucially on what features of the data are extracted. However, we observe that the contrastive loss does not always sufficiently guide which features are extracted, a behavior that can negatively impact the performance on downstream tasks via "shortcuts", i.e., by inadvertently suppressing important predictive features.… ▽ More

    Submitted 19 December, 2021; v1 submitted 21 June, 2021; originally announced June 2021.

    Comments: NeurIPS 2021

  22. arXiv:2103.07079  [pdf, other

    cs.LG math.OC

    Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We propose matrix norm inequalities that extend the Recht-Ré (2012) conjecture on a noncommutative AM-GM inequality by supplementing it with another inequality that accounts for single-shuffle, which is a widely used without-replacement sampling scheme that shuffles only once in the beginning and is overlooked in the Recht-Ré conjecture. Instead of general positive semidefinite matrices, we restri… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 26 pages, 2 figures

  23. arXiv:2102.03192  [pdf, other

    cs.LG

    Provably Efficient Algorithms for Multi-Objective Competitive RL

    Authors: Tiancheng Yu, Yi Tian, **gzhao Zhang, Suvrit Sra

    Abstract: We study multi-objective reinforcement learning (RL) where an agent's reward is represented as a vector. In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set. We develop statistically and computationally efficient algorithms to approach the associated target set. Our results extend Blackwell's approachabilit… ▽ More

    Submitted 5 February, 2021; originally announced February 2021.

  24. arXiv:2012.15483  [pdf, other

    cs.LG stat.ML

    Why do classifier accuracies show linear trends under distribution shift?

    Authors: Horia Mania, Suvrit Sra

    Abstract: Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We explain this trend under an intuitive assumption on model similarity, which was verified empirically in prior work. More precisely, we assume the probability that two models agree in their pr… ▽ More

    Submitted 22 February, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

    Comments: 18 pages, 13 figures

  25. arXiv:2010.15020  [pdf, other

    cs.LG stat.ML

    Online Learning in Unknown Markov Games

    Authors: Yi Tian, Yuanhao Wang, Tiancheng Yu, Suvrit Sra

    Abstract: We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game… ▽ More

    Submitted 6 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

    Comments: 25 pages

  26. arXiv:2010.12230  [pdf, other

    cs.LG cs.CV math.OC

    Co** with Label Shift via Distributionally Robust Optimisation

    Authors: **gzhao Zhang, Aditya Menon, Andreas Veit, Srinadh Bhojanapalli, Sanjiv Kumar, Suvrit Sra

    Abstract: The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, thei… ▽ More

    Submitted 17 August, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

  27. arXiv:2010.04592  [pdf, other

    cs.LG stat.ML

    Contrastive Learning with Hard Negative Samples

    Authors: Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, Stefanie Jegelka

    Abstract: How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negati… ▽ More

    Submitted 24 January, 2021; v1 submitted 9 October, 2020; originally announced October 2020.

    Comments: Published as a conference paper at ICLR 2021

  28. arXiv:2006.13405  [pdf, other

    cs.LG math.OC stat.ML

    Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

    Authors: Yi Tian, Jian Qian, Suvrit Sra

    Abstract: We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational comp… ▽ More

    Submitted 23 June, 2020; originally announced June 2020.

    Comments: 54 pages

  29. arXiv:2006.04429  [pdf, other

    math.OC cs.LG

    Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity

    Authors: **gzhao Zhang, Hongzhou Lin, Subhro Das, Suvrit Sra, Ali Jadbabaie

    Abstract: We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular,… ▽ More

    Submitted 17 June, 2022; v1 submitted 8 June, 2020; originally announced June 2020.

    Journal ref: ICML 2022

  30. arXiv:2005.08304  [pdf, ps, other

    math.OC cs.LG

    Understanding Nesterov's Acceleration via Proximal Point Method

    Authors: Kwangjun Ahn, Suvrit Sra

    Abstract: The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with convergence analyses of different versions of Nesterov's accelerated gradient method (AGM). The key observation is that AGM is a simple approximation of PPM, wh… ▽ More

    Submitted 2 June, 2022; v1 submitted 17 May, 2020; originally announced May 2020.

    Comments: 14 pages; Presented at SIAM Symposium on Simplicity in Algorithms (SOSA22), January 10 - 11, 2022

  31. arXiv:2004.08657  [pdf, ps, other

    math.OC cs.LG

    On Tight Convergence Rates of Without-replacement SGD

    Authors: Kwangjun Ahn, Suvrit Sra

    Abstract: For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD. Denoting by $n$ the number of components in the cost and $K$ the number of epochs of the algorithm , several recent works have shown convergence rates of without-replacement SGD that have better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD. However, ther… ▽ More

    Submitted 18 April, 2020; originally announced April 2020.

    Comments: 12 pages

  32. arXiv:2002.08483  [pdf, other

    cs.LG stat.ML

    Strength from Weakness: Fast Learning Using Weak Supervision

    Authors: Joshua Robinson, Stefanie Jegelka, Suvrit Sra

    Abstract: We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes… ▽ More

    Submitted 19 February, 2020; originally announced February 2020.

    Comments: 21 pages, 8 figures

  33. arXiv:2002.04130  [pdf, other

    math.OC cs.LG

    Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions

    Authors: **gzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Ali Jadbabaie, Suvrit Sra

    Abstract: We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We fi… ▽ More

    Submitted 29 June, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

  34. arXiv:1912.03194  [pdf, other

    math.OC cs.LG

    Why are Adaptive Methods Good for Attention Models?

    Authors: **gzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, Suvrit Sra

    Abstract: While stochastic gradient descent (SGD) is still the \emph{de facto} algorithm in deep learning, adaptive methods like Clipped SGD/Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to adaptive methods are not well understood yet. In this paper, we provide empirical and theoretical evidence that a h… ▽ More

    Submitted 23 October, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

  35. arXiv:1912.01192  [pdf, ps, other

    cs.LG stat.ML

    Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

    Authors: Chi **, Tiancheng **, Haipeng Luo, Suvrit Sra, Tiancheng Yu

    Abstract: We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of ep… ▽ More

    Submitted 2 November, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

    Comments: Fix a bug

    MSC Class: I.2.6 ACM Class: I.2.6

  36. arXiv:1911.02643  [pdf, ps, other

    cs.IT quant-ph

    Metrics Induced by Jensen-Shannon and Related Divergences on Positive Definite Matrices

    Authors: Suvrit Sra

    Abstract: We study metric properties of symmetric divergences on Hermitian positive definite matrices. In particular, we prove that the square root of these divergences is a distance metric. As a corollary we obtain a proof of the metric property for Quantum Jensen-Shannon-(Tsallis) divergences (parameterized by $α\in [0,2]$), which in turn (for $α=1$) yields a proof of the metric property of the Quantum Je… ▽ More

    Submitted 15 December, 2019; v1 submitted 6 November, 2019; originally announced November 2019.

    Comments: 10 pages; reorganized presentation; added new section

    MSC Class: 15A45; 52A99; 47B65; 94A17

  37. arXiv:1910.04194  [pdf, other

    math.OC cs.LG

    Projection-free nonconvex stochastic optimization on Riemannian manifolds

    Authors: Melanie Weber, Suvrit Sra

    Abstract: We study stochastic projection-free methods for constrained optimization of smooth functions on Riemannian manifolds, i.e., with additional constraints beyond the parameter domain being a manifold. Specifically, we introduce stochastic Riemannian Frank-Wolfe methods for nonconvex and geodesically convex problems. We present algorithms for both purely stochastic optimization and finite-sum problems… ▽ More

    Submitted 3 April, 2021; v1 submitted 9 October, 2019; originally announced October 2019.

    Comments: Under Review

  38. arXiv:1907.09350   

    cs.LG stat.ML

    Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

    Authors: Tiancheng Yu, Suvrit Sra

    Abstract: A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by dev… ▽ More

    Submitted 21 August, 2019; v1 submitted 22 July, 2019; originally announced July 2019.

    Comments: There is a problem in the Theorem 1. We will try to fix it and update a new version

  39. arXiv:1907.03922  [pdf, ps, other

    cs.LG math.OC stat.ML

    Are deep ResNets provably better than linear predictors?

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residu… ▽ More

    Submitted 29 October, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

    Comments: 15 pages. NeurIPS 2019 Camera-ready version

  40. arXiv:1906.11289   

    cs.LG stat.ML

    Near Optimal Stratified Sampling

    Authors: Tiancheng Yu, Xiyu Zhai, Suvrit Sra

    Abstract: The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits sta… ▽ More

    Submitted 26 July, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

    Comments: We have discovered a mistake in the main result. The quantity on the RHS of (3) is not equal to the variance of estimator (2) when the sampling rule is designed adaptively as we do. There will be further cross-product terms which are now dominant terms. Therefore, although our bound is correct for (3), it no longer implies bound of the variance of (2)

  41. arXiv:1906.05413  [pdf, other

    cs.LG stat.ML

    Flexible Modeling of Diversity with Strongly Log-Concave Distributions

    Authors: Joshua Robinson, Suvrit Sra, Stefanie Jegelka

    Abstract: Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right e… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

  42. arXiv:1905.11881  [pdf, other

    math.OC cs.LG

    Why gradient clip** accelerates training: A theoretical justification for adaptivity

    Authors: **gzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie

    Abstract: We provide a theoretical explanation for the effectiveness of gradient clip** in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant varia… ▽ More

    Submitted 10 February, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

  43. arXiv:1901.09149  [pdf, other

    cs.LG math.OC stat.ML

    Esca** Saddle Points with Adaptive Gradient Methods

    Authors: Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

    Abstract: Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own,… ▽ More

    Submitted 3 February, 2020; v1 submitted 25 January, 2019; originally announced January 2019.

    Comments: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedman

  44. arXiv:1812.03190  [pdf, other

    cs.LG stat.ML

    Deep-RBF Networks Revisited: Robust Classification with Rejection

    Authors: Pourya Habib Zadeh, Reshad Hosseini, Suvrit Sra

    Abstract: One of the main drawbacks of deep neural networks, like many other classifiers, is their vulnerability to adversarial attacks. An important reason for their vulnerability is assigning high confidence to regions with few or even no feature points. By feature points, we mean a nonlinear transformation of the input space extracting a meaningful representation of the input data. On the other hand, dee… ▽ More

    Submitted 7 December, 2018; originally announced December 2018.

  45. arXiv:1811.04194  [pdf, other

    math.OC cs.LG

    R-SPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate

    Authors: **gzhao Zhang, Hongyi Zhang, Suvrit Sra

    Abstract: We study smooth stochastic optimization problems on Riemannian manifolds. Via adapting the recently proposed SPIDER algorithm \citep{fang2018spider} (a variance reduced stochastic method) to Riemannian manifold, we can achieve faster rate than known algorithms in both the finite sum and stochastic settings. Unlike previous works, by \emph{not} resorting to bounding iterate distances, our analysis… ▽ More

    Submitted 14 December, 2018; v1 submitted 9 November, 2018; originally announced November 2018.

    Comments: arXiv admin note: text overlap with arXiv:1605.07147

  46. arXiv:1810.07770  [pdf, ps, other

    cs.LG stat.ML

    Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $Ω(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $Θ(\sqrt{N})$ is necessary and sufficient for mem… ▽ More

    Submitted 29 October, 2019; v1 submitted 17 October, 2018; originally announced October 2018.

    Comments: 28 pages, 2 figures. NeurIPS 2019 Camera-ready version

  47. arXiv:1809.10858  [pdf, ps, other

    math.OC cs.LG stat.ML

    Efficiently testing local optimality and esca** saddles for ReLU networks

    Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

    Abstract: We provide a theoretical algorithm for checking local optimality and esca** saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into a… ▽ More

    Submitted 28 May, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

    Comments: 23 pages, appeared at ICLR 2019

  48. arXiv:1806.02812  [pdf, other

    math.OC cs.LG

    Towards Riemannian Accelerated Gradient Methods

    Authors: Hongyi Zhang, Suvrit Sra

    Abstract: We propose a Riemannian version of Nesterov's Accelerated Gradient algorithm (RAGD), and show that for geodesically smooth and strongly convex problems, within a neighborhood of the minimizer whose radius depends on the condition number as well as the sectional curvature of the manifold, RAGD converges to the minimizer with acceleration. Unlike the algorithm in (Liu et al., 2017) that requires the… ▽ More

    Submitted 7 June, 2018; originally announced June 2018.

    Comments: Published in 31th Annual Conference on Learning Theory (COLT'18)

  49. arXiv:1805.00521  [pdf, other

    math.OC cs.LG stat.ML

    Direct Runge-Kutta Discretization Achieves Acceleration

    Authors: **gzhao Zhang, Aryan Mokhtari, Suvrit Sra, Ali Jadbabaie

    Abstract: We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lip… ▽ More

    Submitted 27 November, 2018; v1 submitted 1 May, 2018; originally announced May 2018.

    Comments: 24 pages. 4 figures

  50. arXiv:1803.11064  [pdf, other

    cs.CV

    Non-Linear Temporal Subspace Representations for Activity Recognition

    Authors: Anoop Cherian, Suvrit Sra, Stephen Gould, Richard Hartley

    Abstract: Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action… ▽ More

    Submitted 27 March, 2018; originally announced March 2018.

    Comments: Accepted at the IEEE International Conference on Computer Vision and Pattern Recognition, CVPR, 2018. arXiv admin note: substantial text overlap with arXiv:1705.08583