Search | arXiv e-print repository

Efficient Sampling on Riemannian Manifolds via Langevin MCMC

Authors: Xiang Cheng, **gzhao Zhang, Suvrit Sra

Abstract: We study the task of efficiently sampling from a Gibbs distribution $d π^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice. The key to our analysis of Langevin MCMC is a bound on the discretization error of the geometric Euler-Murayama sche… ▽ More We study the task of efficiently sampling from a Gibbs distribution $d π^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is efficiently implementable in practice. The key to our analysis of Langevin MCMC is a bound on the discretization error of the geometric Euler-Murayama scheme, assuming $\nabla h$ is Lipschitz and $M$ has bounded sectional curvature. Our error bound matches the error of Euclidean Euler-Murayama in terms of its stepsize dependence. Combined with a contraction guarantee for the geometric Langevin Diffusion under Kendall-Cranston coupling, we prove that the Langevin MCMC iterates lie within $ε$-Wasserstein distance of $π^*$ after $\tilde{O}(ε^{-2})$ steps, which matches the iteration complexity for Euclidean Langevin MCMC. Our results apply in general settings where $h$ can be nonconvex and $M$ can have negative Ricci curvature. Under additional assumptions that the Riemannian curvature tensor has bounded derivatives, and that $π^*$ satisfies a $CD(\cdot,\infty)$ condition, we analyze the stochastic gradient version of Langevin MCMC, and bound its iteration complexity by $\tilde{O}(ε^{-2})$ as well. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: This is an old paper from NeurIPS 2022. arXiv admin note: text overlap with arXiv:2204.13665

arXiv:2305.15287 [pdf, other]

The Crucial Role of Normalization in Sharpness-Aware Minimization

Authors: Yan Dai, Kwangjun Ahn, Suvrit Sra

Abstract: Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically an… ▽ More Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima -- a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments. △ Less

Submitted 23 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 30 pages, Published in 37th Neural Information Processing Systems (NeurIPS 2023)

arXiv:2212.14511 [pdf, other]

Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?

Authors: Yi Tian, Kaiqing Zhang, Russ Tedrake, Suvrit Sra

Abstract: We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particul… ▽ More We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations. △ Less

Submitted 13 March, 2024; v1 submitted 29 December, 2022; originally announced December 2022.

Comments: 37 pages; Updated structure and proofs

arXiv:2202.13013 [pdf, other]

Sign and Basis Invariant Networks for Spectral Graph Representation Learning

Authors: Derek Lim, Joshua Robinson, Lingxiao Zhao, Tess Smidt, Suvrit Sra, Haggai Maron, Stefanie Jegelka

Abstract: We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i… ▽ More We introduce SignNet and BasisNet -- new neural architectures that are invariant to two key symmetries displayed by eigenvectors: (i) sign flips, since if $v$ is an eigenvector then so is $-v$; and (ii) more general basis symmetries, which occur in higher dimensional eigenspaces with infinitely many choices of basis eigenvectors. We prove that under certain conditions our networks are universal, i.e., they can approximate any continuous function of eigenvectors with the desired invariances. When used with Laplacian eigenvectors, our networks are provably more expressive than existing spectral methods on graphs; for instance, they subsume all spectral graph convolutions, certain spectral graph invariants, and previously proposed graph positional encodings as special cases. Experiments show that our networks significantly outperform existing baselines on molecular graph regression, learning expressive graph representations, and learning neural fields on triangle meshes. Our code is available at https://github.com/cptq/SignNet-BasisNet . △ Less

Submitted 30 September, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

Comments: 42 pages

arXiv:2202.06950 [pdf, other]

Sion's Minimax Theorem in Geodesic Metric Spaces and a Riemannian Extragradient Algorithm

Authors: Peiyuan Zhang, **gzhao Zhang, Suvrit Sra

Abstract: Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. This paper takes a step towards understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems over geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems.… ▽ More Deciding whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. This paper takes a step towards understanding a broad class of nonconvex-nonconcave minimax problems that do remain tractable. Specifically, it studies minimax problems over geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and broadly accessible as it relies on the finite intersection property alone. The second main result is a specialization to geodesically complete Riemannian manifolds: here, we devise and analyze the complexity of first-order methods for smooth minimax problems. △ Less

Submitted 28 May, 2023; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: 23 pages, 3 figures

arXiv:2112.14862 [pdf, ps, other]

Time varying regression with hidden linear dynamics

Authors: Ali Jadbabaie, Horia Mania, Devavrat Shah, Suvrit Sra

Abstract: We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and d… ▽ More We revisit a model for time-varying linear regression that assumes the unknown parameters evolve according to a linear dynamical system. Counterintuitively, we show that when the underlying dynamics are stable the parameters of this model can be estimated from data by combining just two ordinary least squares estimates. We offer a finite sample guarantee on the estimation error of our method and discuss certain advantages it has over Expectation-Maximization (EM), which is the main approach proposed by prior work. △ Less

Submitted 29 December, 2021; originally announced December 2021.

Comments: 22 pages

arXiv:2111.02763 [pdf, ps, other]

Understanding Riemannian Acceleration via a Proximal Extragradient Framework

Authors: Jikai **, Suvrit Sra

Abstract: We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of ne… ▽ More We contribute to advancing the understanding of Riemannian accelerated gradient methods. In particular, we revisit Accelerated Hybrid Proximal Extragradient(A-HPE), a powerful framework for obtaining Euclidean accelerated methods \citep{monteiro2013accelerated}. Building on A-HPE, we then propose and analyze Riemannian A-HPE. The core of our analysis consists of two key components: (i) a set of new insights into Euclidean A-HPE itself; and (ii) a careful control of metric distortion caused by Riemannian geometry. We illustrate our framework by obtaining a few existing and new Riemannian accelerated gradient methods as special cases, while characterizing their acceleration as corollaries of our main results. △ Less

Submitted 9 February, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

arXiv:2110.10342 [pdf, other]

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

Authors: Chulhee Yun, Shashank Rajput, Suvrit Sra

Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients… ▽ More In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings. △ Less

Submitted 23 March, 2022; v1 submitted 19 October, 2021; originally announced October 2021.

Comments: ICLR 2022 camera-ready (selected for an oral presentation); 76 pages, 3 figures

arXiv:2110.06256 [pdf, other]

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Authors: **gzhao Zhang, Haochuan Li, Suvrit Sra, Ali Jadbabaie

Abstract: This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the… ▽ More This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice. △ Less

Submitted 17 June, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Journal ref: ICML 2022

arXiv:2012.15483 [pdf, other]

Why do classifier accuracies show linear trends under distribution shift?

Authors: Horia Mania, Suvrit Sra

Abstract: Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We explain this trend under an intuitive assumption on model similarity, which was verified empirically in prior work. More precisely, we assume the probability that two models agree in their pr… ▽ More Recent studies of generalization in deep learning have observed a puzzling trend: accuracies of models on one data distribution are approximately linear functions of the accuracies on another distribution. We explain this trend under an intuitive assumption on model similarity, which was verified empirically in prior work. More precisely, we assume the probability that two models agree in their predictions is higher than what we can infer from their accuracy levels alone. Then, we show that a linear trend must occur when evaluating models on two distributions unless the size of the distribution shift is large. This work emphasizes the value of understanding model similarity, which can have an impact on the generalization and robustness of classification models. △ Less

Submitted 22 February, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

Comments: 18 pages, 13 figures

arXiv:2010.15020 [pdf, other]

Online Learning in Unknown Markov Games

Authors: Yi Tian, Yuanhao Wang, Tiancheng Yu, Suvrit Sra

Abstract: We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game… ▽ More We study online learning in unknown Markov games, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the \emph{minimax value} of the game, and present an algorithm that achieves a sublinear $\tilde{\mathcal{O}}(K^{2/3})$ regret after $K$ episodes. This is the first sublinear regret bound (to our knowledge) for online learning in unknown Markov games. Importantly, our regret bound is independent of the size of the opponents' action spaces. As a result, even when the opponents' actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents. △ Less

Submitted 6 February, 2021; v1 submitted 28 October, 2020; originally announced October 2020.

Comments: 25 pages

arXiv:2010.04592 [pdf, other]

Contrastive Learning with Hard Negative Samples

Authors: Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, Stefanie Jegelka

Abstract: How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negati… ▽ More How can you sample good negative examples for contrastive learning? We argue that, as with metric learning, contrastive learning of representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use true similarity information. In response, we develop a new family of unsupervised sampling methods for selecting hard negative samples where the user can control the hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead. △ Less

Submitted 24 January, 2021; v1 submitted 9 October, 2020; originally announced October 2020.

Comments: Published as a conference paper at ICLR 2021

arXiv:2006.13405 [pdf, other]

Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes

Authors: Yi Tian, Jian Qian, Suvrit Sra

Abstract: We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational comp… ▽ More We study minimax optimal reinforcement learning in episodic factored Markov decision processes (FMDPs), which are MDPs with conditionally independent transition components. Assuming the factorization is known, we propose two model-based algorithms. The first one achieves minimax optimal regret guarantees for a rich class of factored structures, while the second one enjoys better computational complexity with a slightly worse regret. A key new ingredient of our algorithms is the design of a bonus term to guide exploration. We complement our algorithms by presenting several structure-dependent lower bounds on regret for FMDPs that reveal the difficulty hiding in the intricacy of the structures. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: 54 pages

arXiv:2006.06946 [pdf, other]

SGD with shuffling: optimal rates without component convexity and large epoch requirements

Authors: Kwangjun Ahn, Chulhee Yun, Suvrit Sra

Abstract: We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is ge… ▽ More We study without-replacement SGD for solving finite-sum optimization problems. Specifically, depending on how the indices of the finite-sum are shuffled, we consider the RandomShuffle (shuffle at the beginning of each epoch) and SingleShuffle (shuffle only once) algorithms. First, we establish minimax optimal convergence rates of these algorithms up to poly-log factors. Notably, our analysis is general enough to cover gradient dominated nonconvex costs, and does not rely on the convexity of individual component functions unlike existing optimal convergence results. Secondly, assuming convexity of the individual components, we further sharpen the tight convergence results for RandomShuffle by removing the drawbacks common to all prior arts: large number of epochs required for the results to hold, and extra poly-log factor gaps to the lower bound. △ Less

Submitted 21 June, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 53 pages; supersedes the preprint arXiv:2004.08657; v2 corrects an erroneous claim about SingleShuffle and newly adds Theorem 24 and Appendix F for SingleShuffle

arXiv:2002.08483 [pdf, other]

Strength from Weakness: Fast Learning Using Weak Supervision

Authors: Joshua Robinson, Stefanie Jegelka, Suvrit Sra

Abstract: We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes… ▽ More We study generalization properties of weakly supervised learning. That is, learning where only a few "strong" labels (the actual target of our prediction) are present but many more "weak" labels are available. In particular, we show that having access to weak labels can significantly accelerate the learning rate for the strong task to the fast rate of $\mathcal{O}(\nicefrac1n)$, where $n$ denotes the number of strongly labeled data points. This acceleration can happen even if by itself the strongly labeled data admits only the slower $\mathcal{O}(\nicefrac{1}{\sqrt{n}})$ rate. The actual acceleration depends continuously on the number of weak labels available, and on the relation between the two tasks. Our theoretical results are reflected empirically across a range of tasks and illustrate how weak labels speed up learning on the strong task. △ Less

Submitted 19 February, 2020; originally announced February 2020.

Comments: 21 pages, 8 figures

arXiv:2001.08876 [pdf, other]

From Nesterov's Estimate Sequence to Riemannian Acceleration

Authors: Kwangjun Ahn, Suvrit Sra

Abstract: We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion… ▽ More We propose the first global accelerated gradient method for Riemannian manifolds. Toward establishing our result we revisit Nesterov's estimate sequence technique and develop an alternative analysis for it that may also be of independent interest. Then, we extend this analysis to the Riemannian setting, localizing the key difficulty due to non-Euclidean structure into a certain ``metric distortion.'' We control this distortion by develo** a novel geometric inequality, which permits us to propose and analyze a Riemannian counterpart to Nesterov's accelerated gradient method. △ Less

Submitted 23 January, 2020; originally announced January 2020.

Comments: 30 pages

arXiv:1912.01192 [pdf, ps, other]

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Authors: Chi **, Tiancheng **, Haipeng Luo, Suvrit Sra, Tiancheng Yu

Abstract: We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of ep… ▽ More We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$. △ Less

Submitted 2 November, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

Comments: Fix a bug

MSC Class: I.2.6 ACM Class: I.2.6

arXiv:1907.09350

Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

Authors: Tiancheng Yu, Suvrit Sra

Abstract: A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by dev… ▽ More A Markov Decision Process (MDP) is a popular model for reinforcement learning. However, its commonly used assumption of stationary dynamics and rewards is too stringent and fails to hold in adversarial, nonstationary, or multi-agent problems. We study an episodic setting where the parameters of an MDP can differ across episodes. We learn a reliable policy of this potentially adversarial MDP by develo** an Adversarial Reinforcement Learning (ARL) algorithm that reduces our MDP to a sequence of \emph{adversarial} bandit problems. ARL achieves $O(\sqrt{SATH^3})$ regret, which is optimal with respect to $S$, $A$, and $T$, and its dependence on $H$ is the best (even for the usual stationary MDP) among existing model-free methods. △ Less

Submitted 21 August, 2019; v1 submitted 22 July, 2019; originally announced July 2019.

Comments: There is a problem in the Theorem 1. We will try to fix it and update a new version

arXiv:1907.03922 [pdf, ps, other]

Are deep ResNets provably better than linear predictors?

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residu… ▽ More Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the "near-identity regions" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity. △ Less

Submitted 29 October, 2019; v1 submitted 8 July, 2019; originally announced July 2019.

Comments: 15 pages. NeurIPS 2019 Camera-ready version

arXiv:1906.11289

Near Optimal Stratified Sampling

Authors: Tiancheng Yu, Xiyu Zhai, Suvrit Sra

Abstract: The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits sta… ▽ More The performance of a machine learning system is usually evaluated by using i.i.d.\ observations with true labels. However, acquiring ground truth labels is expensive, while obtaining unlabeled samples may be cheaper. Stratified sampling can be beneficial in such settings and can reduce the number of true labels required without compromising the evaluation accuracy. Stratified sampling exploits statistical properties (e.g., variance) across strata of the unlabeled population, though usually under the unrealistic assumption that these properties are known. We propose two new algorithms that simultaneously estimate these properties and optimize the evaluation accuracy. We construct a lower bound to show the proposed algorithms (to log-factors) are rate optimal. Experiments on synthetic and real data show the reduction in label complexity that is enabled by our algorithms. △ Less

Submitted 26 July, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

Comments: We have discovered a mistake in the main result. The quantity on the RHS of (3) is not equal to the variance of estimator (2) when the sampling rule is designed adaptively as we do. There will be further cross-product terms which are now dominant terms. Therefore, although our bound is correct for (3), it no longer implies bound of the variance of (2)

arXiv:1906.05413 [pdf, other]

Flexible Modeling of Diversity with Strongly Log-Concave Distributions

Authors: Joshua Robinson, Suvrit Sra, Stefanie Jegelka

Abstract: Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right e… ▽ More Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. We develop two fundamental tools needed to apply SLC distributions to learning and inference: sampling and mode finding. For sampling we develop an MCMC sampler and give theoretical mixing time bounds. For mode finding, we establish a weak log-submodularity property for SLC functions and derive optimization guarantees for a distorted greedy algorithm. △ Less

Submitted 12 June, 2019; originally announced June 2019.

arXiv:1901.09149 [pdf, other]

Esca** Saddle Points with Adaptive Gradient Methods

Authors: Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

Abstract: Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own,… ▽ More Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points. △ Less

Submitted 3 February, 2020; v1 submitted 25 January, 2019; originally announced January 2019.

Comments: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedman

arXiv:1812.03190 [pdf, other]

Deep-RBF Networks Revisited: Robust Classification with Rejection

Authors: Pourya Habib Zadeh, Reshad Hosseini, Suvrit Sra

Abstract: One of the main drawbacks of deep neural networks, like many other classifiers, is their vulnerability to adversarial attacks. An important reason for their vulnerability is assigning high confidence to regions with few or even no feature points. By feature points, we mean a nonlinear transformation of the input space extracting a meaningful representation of the input data. On the other hand, dee… ▽ More One of the main drawbacks of deep neural networks, like many other classifiers, is their vulnerability to adversarial attacks. An important reason for their vulnerability is assigning high confidence to regions with few or even no feature points. By feature points, we mean a nonlinear transformation of the input space extracting a meaningful representation of the input data. On the other hand, deep-RBF networks assign high confidence only to the regions containing enough feature points, but they have been discounted due to the widely-held belief that they have the vanishing gradient problem. In this paper, we revisit the deep-RBF networks by first giving a general formulation for them, and then proposing a family of cost functions thereof inspired by metric learning. In the proposed deep-RBF learning algorithm, the vanishing gradient problem does not occur. We make these networks robust to adversarial attack by adding the reject option to their output layer. Through several experiments on the MNIST dataset, we demonstrate that our proposed method not only achieves significant classification accuracy but is also very resistant to various adversarial attacks. △ Less

Submitted 7 December, 2018; originally announced December 2018.

arXiv:1810.07770 [pdf, ps, other]

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $Ω(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $Θ(\sqrt{N})$ is necessary and sufficient for mem… ▽ More We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $Ω(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $Θ(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an $L$-layer network with $W$ parameters in the hidden layers can memorize $N$ data points if $W = Ω(N)$. Combined with a recent upper bound $O(WL\log W)$ on VC dimension, our construction is nearly tight for any fixed $L$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of $N$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk. △ Less

Submitted 29 October, 2019; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: 28 pages, 2 figures. NeurIPS 2019 Camera-ready version

arXiv:1809.10858 [pdf, ps, other]

Efficiently testing local optimality and esca** saddles for ReLU networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We provide a theoretical algorithm for checking local optimality and esca** saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into a… ▽ More We provide a theoretical algorithm for checking local optimality and esca** saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into at most $2^M$ regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, $O(M)$ (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases. △ Less

Submitted 28 May, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

Comments: 23 pages, appeared at ICLR 2019

arXiv:1806.10077 [pdf, other]

Random Shuffling Beats SGD after Finite Epochs

Authors: Jeff Z. HaoChen, Suvrit Sra

Abstract: A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we pro… ▽ More A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings. △ Less

Submitted 7 October, 2019; v1 submitted 26 June, 2018; originally announced June 2018.

arXiv:1805.00521 [pdf, other]

Direct Runge-Kutta Discretization Achieves Acceleration

Authors: **gzhao Zhang, Aryan Mokhtari, Suvrit Sra, Ali Jadbabaie

Abstract: We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lip… ▽ More We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results. △ Less

Submitted 27 November, 2018; v1 submitted 1 May, 2018; originally announced May 2018.

Comments: 24 pages. 4 figures

arXiv:1802.03487 [pdf, ps, other]

Small nonlinearities in activation functions create bad local minima in neural networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like… ▽ More We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic. △ Less

Submitted 28 May, 2019; v1 submitted 9 February, 2018; originally announced February 2018.

Comments: 33 pages, appeared at ICLR 2019

arXiv:1707.02444 [pdf, ps, other]

Global optimality conditions for deep neural networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global mi… ▽ More We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global minimum. Surprisingly, our conditions provide an efficiently checkable test for global optimality, while such tests are typically intractable in nonconvex optimization. We further extend these results to deep nonlinear neural networks and prove similar sufficient conditions for global optimality, albeit in a more limited function space setting. △ Less

Submitted 24 March, 2018; v1 submitted 8 July, 2017; originally announced July 2017.

Comments: 14 pages. A camera-ready version that will appear at ICLR 2018

arXiv:1706.03267 [pdf, other]

An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization

Authors: Reshad Hosseini, Suvrit Sra

Abstract: We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemann… ▽ More We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemannian geometry of positive definite matrices, using which we cast Gmm parameter estimation as a Riemannian optimization problem. Surprisingly, such an out-of-the-box Riemannian formulation completely fails and proves much inferior to EM. This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization. We then develop (Riemannian) batch and stochastic gradient algorithms that outperform EM, often substantially. We provide a non-asymptotic convergence analysis for our stochastic method, which is also the first (to our knowledge) such global analysis for Riemannian stochastic gradient. Numerous empirical results are included to demonstrate the effectiveness of our methods. △ Less

Submitted 10 June, 2017; originally announced June 2017.

Comments: 21 pages, 6 figures

arXiv:1703.02674 [pdf, other]

Polynomial Time Algorithms for Dual Volume Sampling

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wide… ▽ More We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wider adoption has been hampered by the lack of polynomial time sampling algorithms. We remove this hindrance by develo** an exact (randomized) polynomial time sampling algorithm as well as its derandomization. Thereafter, we study dual volume sampling via the theory of real stable polynomials and prove that its distribution satisfies the "Strong Rayleigh" property. This result has numerous consequences, including a provably fast-mixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. This sampler is closely related to classical algorithms for popular experimental design methods that are to date lacking theoretical analysis but are known to empirically work well. △ Less

Submitted 15 November, 2017; v1 submitted 7 March, 2017; originally announced March 2017.

arXiv:1608.01008 [pdf, other]

Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: We study probability measures induced by set functions with constraints. Such measures arise in a variety of real-world settings, where prior knowledge, resource limitations, or other pragmatic considerations impose constraints. We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them. Our first main result is for MCMC sampling from S… ▽ More We study probability measures induced by set functions with constraints. Such measures arise in a variety of real-world settings, where prior knowledge, resource limitations, or other pragmatic considerations impose constraints. We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them. Our first main result is for MCMC sampling from Strongly Rayleigh (SR) measures, for which we present sharp polynomial bounds on the mixing time. As a corollary, this result yields a fast mixing sampler for Determinantal Point Processes (DPPs), yielding (to our knowledge) the first provably fast MCMC sampler for DPPs since their inception over four decades ago. Beyond SR measures, we develop MCMC samplers for probabilistic models with hard constraints and identify sufficient conditions under which their chains mix rapidly. We illustrate our claims by empirically verifying the dependence of mixing times on the key factors governing our theoretical bounds. △ Less

Submitted 8 January, 2017; v1 submitted 2 August, 2016; originally announced August 2016.

Comments: The present version subsumes arXiv:1607.03559

arXiv:1607.08254 [pdf, other]

Stochastic Frank-Wolfe Methods for Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limite… ▽ More We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic Frank-Wolfe methods and analyze their convergence properties. For objective functions that decompose into a finite-sum, we leverage ideas from variance reduction techniques for convex optimization to obtain new variance reduced nonconvex Frank-Wolfe methods that have provably faster convergence than the classical Frank-Wolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting. △ Less

Submitted 29 July, 2016; v1 submitted 27 July, 2016; originally announced July 2016.

arXiv:1607.05002 [pdf, ps, other]

Geometric Mean Metric Learning

Authors: Pourya Habib Zadeh, Reshad Hosseini, Suvrit Sra

Abstract: We revisit the task of learning a Euclidean metric from data. We approach this problem from first principles and formulate it as a surprisingly simple optimization problem. Indeed, our formulation even admits a closed form solution. This solution possesses several very attractive properties: (i) an innate geometric appeal through the Riemannian geometry of positive definite matrices; (ii) ease of… ▽ More We revisit the task of learning a Euclidean metric from data. We approach this problem from first principles and formulate it as a surprisingly simple optimization problem. Indeed, our formulation even admits a closed form solution. This solution possesses several very attractive properties: (i) an innate geometric appeal through the Riemannian geometry of positive definite matrices; (ii) ease of interpretability; and (iii) computational speed several orders of magnitude faster than the widely used LMNN and ITML methods. Furthermore, on standard benchmark datasets, our closed-form solution consistently attains higher classification accuracy. △ Less

Submitted 18 July, 2016; originally announced July 2016.

Comments: 7 pages, 4 figures

arXiv:1607.03559 [pdf, other]

Fast Sampling for Strongly Rayleigh Measures with Application to Determinantal Point Processes

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: In this note we consider sampling from (non-homogeneous) strongly Rayleigh probability measures. As an important corollary, we obtain a fast mixing Markov Chain sampler for Determinantal Point Processes. In this note we consider sampling from (non-homogeneous) strongly Rayleigh probability measures. As an important corollary, we obtain a fast mixing Markov Chain sampler for Determinantal Point Processes. △ Less

Submitted 12 July, 2016; originally announced July 2016.

arXiv:1605.08374 [pdf, other]

Kronecker Determinantal Point Processes

Authors: Zelda Mariet, Suvrit Sra

Abstract: Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of $N$ items. They have recently gained prominence in several applications that rely on "diverse" subsets. However, their applicability to large problems is still limited due to the $\mathcal O(N^3)$ complexity of core tasks such as sampling and learning. We enable efficient sampling and learning for DPPs b… ▽ More Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of $N$ items. They have recently gained prominence in several applications that rely on "diverse" subsets. However, their applicability to large problems is still limited due to the $\mathcal O(N^3)$ complexity of core tasks such as sampling and learning. We enable efficient sampling and learning for DPPs by introducing KronDPP, a DPP model whose kernel matrix decomposes as a tensor product of multiple smaller kernel matrices. This decomposition immediately enables fast exact sampling. But contrary to what one may expect, leveraging the Kronecker product structure for speeding up DPP learning turns out to be more difficult. We overcome this challenge, and derive batch and stochastic optimization algorithms for efficiently learning the parameters of a KronDPP. △ Less

Submitted 26 May, 2016; originally announced May 2016.

arXiv:1605.06900 [pdf, other]

Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle… ▽ More We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle this issue, we develop fast stochastic algorithms that provably converge to a stationary point for constant minibatches. Furthermore, using a variant of these algorithms, we show provably faster convergence than batch proximal gradient descent. Finally, we prove global linear convergence rate for an interesting subclass of nonsmooth nonconvex functions, that subsumes several recent works. This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions. △ Less

Submitted 23 May, 2016; originally announced May 2016.

arXiv:1605.00316 [pdf, other]

Directional Statistics in Machine Learning: a Brief Review

Authors: Suvrit Sra

Abstract: The modern data analyst must cope with data encoded in various forms, vectors, matrices, strings, graphs, or more. Consequently, statistical and machine learning models tailored to different data encodings are important. We focus on data encoded as normalized vectors, so that their "direction" is more important than their magnitude. Specifically, we consider high-dimensional vectors that lie eithe… ▽ More The modern data analyst must cope with data encoded in various forms, vectors, matrices, strings, graphs, or more. Consequently, statistical and machine learning models tailored to different data encodings are important. We focus on data encoded as normalized vectors, so that their "direction" is more important than their magnitude. Specifically, we consider high-dimensional vectors that lie either on the surface of the unit hypersphere or on the real projective plane. For such data, we briefly review common mathematical models prevalent in machine learning, while also outlining some technical aspects, software, applications, and open mathematical challenges. △ Less

Submitted 1 May, 2016; originally announced May 2016.

Comments: 12 pages, slightly modified version of submitted book chapter

arXiv:1604.02027 [pdf, other]

Combinatorial Topic Models using Small-Variance Asymptotics

Authors: Ke Jiang, Suvrit Sra, Brian Kulis

Abstract: Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and propose a new objective function derived from LDA by passing to the small-varianc… ▽ More Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and propose a new objective function derived from LDA by passing to the small-variance limit. We minimize the derived objective by using ideas from combinatorial optimization, which results in a new, fast, and high-quality topic modeling algorithm. In particular, we show that our results are competitive with popular LDA-based topic modeling approaches, and also discuss the (dis)similarities between our approach and its probabilistic counterparts. △ Less

Submitted 26 May, 2016; v1 submitted 7 April, 2016; originally announced April 2016.

Comments: 19 pages

arXiv:1603.06160 [pdf, other]

Stochastic Variance Reduction for Nonconvex Optimization

Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary po… ▽ More We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. △ Less

Submitted 4 April, 2016; v1 submitted 19 March, 2016; originally announced March 2016.

Comments: Minor feedback changes

arXiv:1603.06159 [pdf, other]

Fast Incremental Method for Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconve… ▽ More We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconvex problems for which SAGA converges at a linear rate to the global optimum. Finally, we analyze the practically valuable regularized and minibatch variants of SAGA. To our knowledge, this paper presents the first analysis of fast convergence for an incremental aggregated gradient method for nonconvex problems. △ Less

Submitted 19 March, 2016; originally announced March 2016.

arXiv:1602.06053 [pdf, other]

First-order Methods for Geodesically Convex Optimization

Authors: Hongyi Zhang, Suvrit Sra

Abstract: Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces. But unlike convex optimization, geodesically convex (g-convex) optimization is much less developed. In this paper we contribute to the understanding of g-convex optimization by develo** iteration complexity analysis for several first-order algorithms on Hadamard manifolds. Specifically, we prove upp… ▽ More Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces. But unlike convex optimization, geodesically convex (g-convex) optimization is much less developed. In this paper we contribute to the understanding of g-convex optimization by develo** iteration complexity analysis for several first-order algorithms on Hadamard manifolds. Specifically, we prove upper bounds for the global complexity of deterministic and stochastic (sub)gradient methods for optimizing smooth and nonsmooth g-convex functions, both with and without strong g-convexity. Our analysis also reveals how the manifold geometry, especially \emph{sectional curvature}, impacts convergence rates. To the best of our knowledge, our work is the first to provide global complexity analysis for first-order algorithms for general g-convex optimization. △ Less

Submitted 19 February, 2016; originally announced February 2016.

Comments: 21 pages

arXiv:1512.01904 [pdf, other]

Gauss quadrature for matrix inverse forms with applications

Authors: Chengtao Li, Suvrit Sra, Stefanie Jegelka

Abstract: We present a framework for accelerating a spectrum of machine learning algorithms that require computation of bilinear inverse forms $u^\top A^{-1}u$, where $A$ is a positive definite matrix and $u$ a given vector. Our framework is built on Gauss-type quadrature and easily scales to large, sparse matrices. Further, it allows retrospective computation of lower and upper bounds on $u^\top A^{-1}u$,… ▽ More We present a framework for accelerating a spectrum of machine learning algorithms that require computation of bilinear inverse forms $u^\top A^{-1}u$, where $A$ is a positive definite matrix and $u$ a given vector. Our framework is built on Gauss-type quadrature and easily scales to large, sparse matrices. Further, it allows retrospective computation of lower and upper bounds on $u^\top A^{-1}u$, which in turn accelerates several algorithms. We prove that these bounds tighten iteratively and converge at a linear (geometric) rate. To our knowledge, ours is the first work to demonstrate these key properties of Gauss-type quadrature, which is a classical and deeply studied topic. We illustrate empirical consequences of our results by using quadrature to accelerate machine learning tasks involving determinantal point processes and submodular optimization, and observe tremendous speedups in several instances. △ Less

Submitted 28 May, 2016; v1 submitted 6 December, 2015; originally announced December 2015.

arXiv:1508.05003 [pdf, other]

AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization

Authors: Suvrit Sra, Adams Wei Yu, Mu Li, Alexander J. Smola

Abstract: We study distributed stochastic convex optimization under the delayed gradient model where the server nodes perform parameter updates, while the worker nodes compute stochastic gradients. We discuss, analyze, and experiment with a setup motivated by the behavior of real-world distributed computation networks, where the machines are differently slow at different time. Therefore, we allow the parame… ▽ More We study distributed stochastic convex optimization under the delayed gradient model where the server nodes perform parameter updates, while the worker nodes compute stochastic gradients. We discuss, analyze, and experiment with a setup motivated by the behavior of real-world distributed computation networks, where the machines are differently slow at different time. Therefore, we allow the parameter updates to be sensitive to the actual delays experienced, rather than to worst-case bounds on the maximum delay. This sensitivity leads to larger stepsizes, that can help gain rapid initial convergence without having to wait too long for slower machines, while maintaining the same asymptotic complexity. We obtain encouraging improvements to overall convergence for distributed experiments on real datasets with up to billions of examples and features. △ Less

Submitted 20 August, 2015; originally announced August 2015.

Comments: 19 pages

arXiv:1506.07677 [pdf, other]

Manifold Optimization for Gaussian Mixture Models

Authors: Reshad Hosseini, Suvrit Sra

Abstract: We take a new look at parameter estimation for Gaussian Mixture Models (GMMs). In particular, we propose using \emph{Riemannian manifold optimization} as a powerful counterpart to Expectation Maximization (EM). An out-of-the-box invocation of manifold optimization, however, fails spectacularly: it converges to the same solution but vastly slower. Driven by intuition from manifold convexity, we the… ▽ More We take a new look at parameter estimation for Gaussian Mixture Models (GMMs). In particular, we propose using \emph{Riemannian manifold optimization} as a powerful counterpart to Expectation Maximization (EM). An out-of-the-box invocation of manifold optimization, however, fails spectacularly: it converges to the same solution but vastly slower. Driven by intuition from manifold convexity, we then propose a reparamerization that has remarkable empirical consequences. It makes manifold optimization not only match EM---a highly encouraging result in itself given the poor record nonlinear programming methods have had against EM so far---but also outperform EM in many practical settings, while displaying much less variability in running times. We further highlight the strengths of manifold optimization by develo** a somewhat tuned manifold LBFGS method that proves even more competitive and reliable than existing manifold optimization tools. We hope that our results encourage a wider consideration of manifold optimization for parameter estimation problems. △ Less

Submitted 25 June, 2015; originally announced June 2015.

Comments: 19 pages

arXiv:1506.06840 [pdf, other]

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, Alex Smola

Abstract: We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale… ▽ More We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale applications---have not been studied. We bridge this gap by presenting a unifying framework for many variance reduction techniques. Subsequently, we propose an asynchronous algorithm grounded in our framework, and prove its fast convergence. An important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as SVRG and SAGA as a byproduct. Our method achieves near linear speedup in sparse settings common to machine learning. We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG. △ Less

Submitted 24 January, 2016; v1 submitted 22 June, 2015; originally announced June 2015.

arXiv:1411.0589 [pdf, other]

Modular proximal optimization for multidimensional total-variation regularization

Authors: Álvaro Barbero, Suvrit Sra

Abstract: We study \emph{TV regularization}, a widely used technique for eliciting structured sparsity. In particular, we propose efficient algorithms for computing prox-operators for $\ell_p$-norm TV. The most important among these is $\ell_1$-norm TV, for whose prox-operator we present a new geometric analysis which unveils a hitherto unknown connection to taut-string methods. This connection turns out to… ▽ More We study \emph{TV regularization}, a widely used technique for eliciting structured sparsity. In particular, we propose efficient algorithms for computing prox-operators for $\ell_p$-norm TV. The most important among these is $\ell_1$-norm TV, for whose prox-operator we present a new geometric analysis which unveils a hitherto unknown connection to taut-string methods. This connection turns out to be remarkably useful as it shows how our geometry guided implementation results in efficient weighted and unweighted 1D-TV solvers, surpassing state-of-the-art methods. Our 1D-TV solvers provide the backbone for building more complex (two or higher-dimensional) TV solvers within a modular proximal optimization approach. We review the literature for an array of methods exploiting this strategy, and illustrate the benefits of our modular design through extensive suite of experiments on (i) image denoising, (ii) image deconvolution, (iii) four variants of fused-lasso, and (iv) video denoising. To underscore our claims and permit easy reproducibility, we provide all the reviewed and our new TV solvers in an easy to use multi-threaded C++, Matlab and Python library. △ Less

Submitted 30 December, 2017; v1 submitted 3 November, 2014; originally announced November 2014.

Comments: 67 pages, 32 figures, new non-iterative fast TV algorithm, extensive new experiments, corresponds to the github proxtv repository now

arXiv:1410.4812 [pdf, other]

doi 10.1016/j.csda.2016.02.009

Inference and Mixture Modeling with the Elliptical Gamma Distribution

Authors: Reshad Hosseini, Suvrit Sra, Lucas Theis, Matthias Bethge

Abstract: We study modeling and inference with the Elliptical Gamma Distribution (EGD). We consider maximum likelihood (ML) estimation for EGD scatter matrices, a task for which we develop new fixed-point algorithms. Our algorithms are efficient and converge to global optima despite nonconvexity. Moreover, they turn out to be much faster than both a well-known iterative algorithm of Kent & Tyler (1991) and… ▽ More We study modeling and inference with the Elliptical Gamma Distribution (EGD). We consider maximum likelihood (ML) estimation for EGD scatter matrices, a task for which we develop new fixed-point algorithms. Our algorithms are efficient and converge to global optima despite nonconvexity. Moreover, they turn out to be much faster than both a well-known iterative algorithm of Kent & Tyler (1991) and sophisticated manifold optimization algorithms. Subsequently, we invoke our ML algorithms as subroutines for estimating parameters of a mixture of EGDs. We illustrate our methods by applying them to model natural image statistics---the proposed EGD mixture model yields the most parsimonious model among several competing approaches. △ Less

Submitted 20 December, 2015; v1 submitted 17 October, 2014; originally announced October 2014.

Comments: 23 pages, 11 figures

Journal ref: Computational Statistics & Data Analysis 2016, Vol. 101, 29-43

arXiv:1409.6086 [pdf, other]

Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms

Authors: Yu-Xiang Wang, Veeranjaneyulu Sadhanala, Wei Dai, Willie Neiswanger, Suvrit Sra, Eric P. Xing

Abstract: We develop parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. Whenever possible, we perform computations asynchronously, which helps attain speedups on multicore machines as well as in distributed environments. Moreover, instead of worst-case bounded delays, our methods only depend (mildly) on \emp… ▽ More We develop parallel and distributed Frank-Wolfe algorithms; the former on shared memory machines with mini-batching, and the latter in a delayed update framework. Whenever possible, we perform computations asynchronously, which helps attain speedups on multicore machines as well as in distributed environments. Moreover, instead of worst-case bounded delays, our methods only depend (mildly) on \emph{expected} delays, allowing them to be robust to stragglers and faulty worker threads. Our algorithms assume block-separable constraints, and subsume the recent Block-Coordinate Frank-Wolfe (BCFW) method~\citep{lacoste2013block}. Our analysis reveals problem-dependent quantities that govern the speedups of our methods over BCFW. We present experiments on structural SVM and Group Fused Lasso, obtaining significant speedups over competing state-of-the-art (and synchronous) methods. △ Less

Submitted 12 February, 2016; v1 submitted 22 September, 2014; originally announced September 2014.

arXiv:1409.2617 [pdf, other]

Large-scale randomized-coordinate descent methods with non-separable linear constraints

Authors: Sashank Reddi, Ahmed Hefny, Carlton Downey, Avinava Dubey, Suvrit Sra

Abstract: We develop randomized (block) coordinate descent (CD) methods for linearly constrained convex optimization. Unlike most CD methods, we do not assume the constraints to be separable, but let them be coupled linearly. To our knowledge, ours is the first CD method that allows linear coupling constraints, without making the global iteration complexity have an exponential dependence on the number of co… ▽ More We develop randomized (block) coordinate descent (CD) methods for linearly constrained convex optimization. Unlike most CD methods, we do not assume the constraints to be separable, but let them be coupled linearly. To our knowledge, ours is the first CD method that allows linear coupling constraints, without making the global iteration complexity have an exponential dependence on the number of constraints. We present algorithms and analysis for four key problem scenarios: (i) smooth; (ii) smooth + nonsmooth separable; (iii) asynchronous parallel; and (iv) stochastic. We illustrate empirical behavior of our algorithms by simulation experiments. △ Less

Submitted 10 June, 2015; v1 submitted 9 September, 2014; originally announced September 2014.

Showing 1–50 of 57 results for author: Suvrit