Search | arXiv e-print repository

arXiv:1906.05413 [pdf, other]

Flexible Modeling of Diversity with Strongly Log-Concave Distributions

Authors: Joshua Robinson, Suvrit Sra, Stefanie Jegelka

Abstract: Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right e… ▽ More Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. We develop two fundamental tools needed to apply SLC distributions to learning and inference: sampling and mode finding. For sampling we develop an MCMC sampler and give theoretical mixing time bounds. For mode finding, we establish a weak log-submodularity property for SLC functions and derive optimization guarantees for a distorted greedy algorithm. △ Less

Submitted 12 June, 2019; originally announced June 2019.

arXiv:1905.12436 [pdf, other]

Acceleration in First Order Quasi-strongly Convex Optimization by ODE Discretization

Authors: **gzhao Zhang, Suvrit Sra, Ali Jadbabaie

Abstract: We study gradient-based optimization methods obtained by direct Runge-Kutta discretization of the ordinary differential equation (ODE) describing the movement of a heavy-ball under constant friction coefficient. When the function is high order smooth and strongly convex, we show that directly simulating the ODE with known numerical integrators achieve acceleration in a nontrivial neighborhood of t… ▽ More We study gradient-based optimization methods obtained by direct Runge-Kutta discretization of the ordinary differential equation (ODE) describing the movement of a heavy-ball under constant friction coefficient. When the function is high order smooth and strongly convex, we show that directly simulating the ODE with known numerical integrators achieve acceleration in a nontrivial neighborhood of the optimal solution. In particular, the neighborhood can grow larger as the condition number of the function increases. Furthermore, our results also hold for nonconvex but quasi-strongly convex objectives. We provide numerical experiments that verify the theoretical rates predicted by our results. △ Less

Submitted 28 May, 2019; originally announced May 2019.

Comments: arXiv admin note: text overlap with arXiv:1805.00521

arXiv:1905.11881 [pdf, other]

Why gradient clip** accelerates training: A theoretical justification for adaptivity

Authors: **gzhao Zhang, Tianxing He, Suvrit Sra, Ali Jadbabaie

Abstract: We provide a theoretical explanation for the effectiveness of gradient clip** in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant varia… ▽ More We provide a theoretical explanation for the effectiveness of gradient clip** in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, \emph{gradient clip**} and \emph{normalized gradient}, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings. △ Less

Submitted 10 February, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1901.09149 [pdf, other]

Esca** Saddle Points with Adaptive Gradient Methods

Authors: Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

Abstract: Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own,… ▽ More Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points. △ Less

Submitted 3 February, 2020; v1 submitted 25 January, 2019; originally announced January 2019.

Comments: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedman

arXiv:1812.03190 [pdf, other]

Deep-RBF Networks Revisited: Robust Classification with Rejection

Authors: Pourya Habib Zadeh, Reshad Hosseini, Suvrit Sra

Abstract: One of the main drawbacks of deep neural networks, like many other classifiers, is their vulnerability to adversarial attacks. An important reason for their vulnerability is assigning high confidence to regions with few or even no feature points. By feature points, we mean a nonlinear transformation of the input space extracting a meaningful representation of the input data. On the other hand, dee… ▽ More One of the main drawbacks of deep neural networks, like many other classifiers, is their vulnerability to adversarial attacks. An important reason for their vulnerability is assigning high confidence to regions with few or even no feature points. By feature points, we mean a nonlinear transformation of the input space extracting a meaningful representation of the input data. On the other hand, deep-RBF networks assign high confidence only to the regions containing enough feature points, but they have been discounted due to the widely-held belief that they have the vanishing gradient problem. In this paper, we revisit the deep-RBF networks by first giving a general formulation for them, and then proposing a family of cost functions thereof inspired by metric learning. In the proposed deep-RBF learning algorithm, the vanishing gradient problem does not occur. We make these networks robust to adversarial attack by adding the reject option to their output layer. Through several experiments on the MNIST dataset, we demonstrate that our proposed method not only achieves significant classification accuracy but is also very resistant to various adversarial attacks. △ Less

Submitted 7 December, 2018; originally announced December 2018.

arXiv:1811.04194 [pdf, other]

R-SPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate

Authors: **gzhao Zhang, Hongyi Zhang, Suvrit Sra

Abstract: We study smooth stochastic optimization problems on Riemannian manifolds. Via adapting the recently proposed SPIDER algorithm \citep{fang2018spider} (a variance reduced stochastic method) to Riemannian manifold, we can achieve faster rate than known algorithms in both the finite sum and stochastic settings. Unlike previous works, by \emph{not} resorting to bounding iterate distances, our analysis… ▽ More We study smooth stochastic optimization problems on Riemannian manifolds. Via adapting the recently proposed SPIDER algorithm \citep{fang2018spider} (a variance reduced stochastic method) to Riemannian manifold, we can achieve faster rate than known algorithms in both the finite sum and stochastic settings. Unlike previous works, by \emph{not} resorting to bounding iterate distances, our analysis yields curvature independent convergence rates for both the nonconvex and strongly convex cases. △ Less

Submitted 14 December, 2018; v1 submitted 9 November, 2018; originally announced November 2018.

Comments: arXiv admin note: text overlap with arXiv:1605.07147

arXiv:1810.07770 [pdf, ps, other]

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $Ω(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $Θ(\sqrt{N})$ is necessary and sufficient for mem… ▽ More We study finite sample expressivity, i.e., memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $Ω(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $Θ(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an $L$-layer network with $W$ parameters in the hidden layers can memorize $N$ data points if $W = Ω(N)$. Combined with a recent upper bound $O(WL\log W)$ on VC dimension, our construction is nearly tight for any fixed $L$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of $N$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk. △ Less

Submitted 29 October, 2019; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: 28 pages, 2 figures. NeurIPS 2019 Camera-ready version

arXiv:1809.10858 [pdf, ps, other]

Efficiently testing local optimality and esca** saddles for ReLU networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We provide a theoretical algorithm for checking local optimality and esca** saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into a… ▽ More We provide a theoretical algorithm for checking local optimality and esca** saddles at nondifferentiable points of empirical risks of two-layer ReLU networks. Our algorithm receives any parameter value and returns: local minimum, second-order stationary point, or a strict descent direction. The presence of $M$ data points on the nondifferentiability of the ReLU divides the parameter space into at most $2^M$ regions, which makes analysis difficult. By exploiting polyhedral geometry, we reduce the total computation down to one convex quadratic program (QP) for each hidden node, $O(M)$ (in)equality tests, and one (or a few) nonconvex QP. For the last QP, we show that our specific problem can be solved efficiently, in spite of nonconvexity. In the benign case, we solve one equality constrained QP, and we prove that projected gradient descent solves it exponentially fast. In the bad case, we have to solve a few more inequality constrained QPs, but we prove that the time complexity is exponential only in the number of inequality constraints. Our experiments show that either benign case or bad case with very few inequality constraints occurs, implying that our algorithm is efficient in most cases. △ Less

Submitted 28 May, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

Comments: 23 pages, appeared at ICLR 2019

arXiv:1806.10077 [pdf, other]

Random Shuffling Beats SGD after Finite Epochs

Authors: Jeff Z. HaoChen, Suvrit Sra

Abstract: A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we pro… ▽ More A long-standing problem in the theory of stochastic gradient descent (SGD) is to prove that its without-replacement version RandomShuffle converges faster than the usual with-replacement version. We present the first (to our knowledge) non-asymptotic solution to this problem, which shows that after a "reasonable" number of epochs RandomShuffle indeed converges faster than SGD. Specifically, we prove that under strong convexity and second-order smoothness, the sequence generated by RandomShuffle converges to the optimal solution at the rate O(1/T^2 + n^3/T^3), where n is the number of components in the objective, and T is the total number of iterations. This result shows that after a reasonable number of epochs RandomShuffle is strictly better than SGD (which converges as O(1/T)). The key step toward showing this better dependence on T is the introduction of n into the bound; and as our analysis will show, in general a dependence on n is unavoidable without further changes to the algorithm. We show that for sparse data RandomShuffle has the rate O(1/T^2), again strictly better than SGD. Furthermore, we discuss extensions to nonconvex gradient dominated functions, as well as non-strongly convex settings. △ Less

Submitted 7 October, 2019; v1 submitted 26 June, 2018; originally announced June 2018.

arXiv:1806.02812 [pdf, other]

Towards Riemannian Accelerated Gradient Methods

Authors: Hongyi Zhang, Suvrit Sra

Abstract: We propose a Riemannian version of Nesterov's Accelerated Gradient algorithm (RAGD), and show that for geodesically smooth and strongly convex problems, within a neighborhood of the minimizer whose radius depends on the condition number as well as the sectional curvature of the manifold, RAGD converges to the minimizer with acceleration. Unlike the algorithm in (Liu et al., 2017) that requires the… ▽ More We propose a Riemannian version of Nesterov's Accelerated Gradient algorithm (RAGD), and show that for geodesically smooth and strongly convex problems, within a neighborhood of the minimizer whose radius depends on the condition number as well as the sectional curvature of the manifold, RAGD converges to the minimizer with acceleration. Unlike the algorithm in (Liu et al., 2017) that requires the exact solution to a nonlinear equation which in turn may be intractable, our algorithm is constructive and computationally tractable. Our proof exploits a new estimate sequence and a novel bound on the nonlinear metric distortion, both ideas may be of independent interest. △ Less

Submitted 7 June, 2018; originally announced June 2018.

Comments: Published in 31th Annual Conference on Learning Theory (COLT'18)

arXiv:1805.00521 [pdf, other]

Direct Runge-Kutta Discretization Achieves Acceleration

Authors: **gzhao Zhang, Aryan Mokhtari, Suvrit Sra, Ali Jadbabaie

Abstract: We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lip… ▽ More We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results. △ Less

Submitted 27 November, 2018; v1 submitted 1 May, 2018; originally announced May 2018.

Comments: 24 pages. 4 figures

arXiv:1803.11064 [pdf, other]

Non-Linear Temporal Subspace Representations for Activity Recognition

Authors: Anoop Cherian, Suvrit Sra, Stephen Gould, Richard Hartley

Abstract: Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action… ▽ More Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are characterized by their variations in time. As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order. We develop this idea further and show that such a pooling scheme can be cast as an order-constrained kernelized PCA objective. We then propose to use the parameters of a kernelized low-rank feature subspace as the representation of the sequences. We cast our formulation as an optimization problem on generalized Grassmann manifolds and then solve it efficiently using Riemannian optimization techniques. We present experiments on several action recognition datasets using diverse feature modalities and demonstrate state-of-the-art results. △ Less

Submitted 27 March, 2018; originally announced March 2018.

Comments: Accepted at the IEEE International Conference on Computer Vision and Pattern Recognition, CVPR, 2018. arXiv admin note: substantial text overlap with arXiv:1705.08583

arXiv:1803.10141 [pdf, other]

New concavity and convexity results for symmetric polynomials and their ratios

Authors: Suvrit Sra

Abstract: We prove some "power" generalizations of Marcus-Lopes-style (including McLeod and Bullen) concavity inequalities for elementary symmetric polynomials, and convexity inequalities (of McLeod and Baston) for complete homogeneous symmetric polynomials. Finally, we present sundry concavity results for elementary symmetric polynomials, of which the main result is a concavity theorem that among other imp… ▽ More We prove some "power" generalizations of Marcus-Lopes-style (including McLeod and Bullen) concavity inequalities for elementary symmetric polynomials, and convexity inequalities (of McLeod and Baston) for complete homogeneous symmetric polynomials. Finally, we present sundry concavity results for elementary symmetric polynomials, of which the main result is a concavity theorem that among other implies a well-known log-convexity result of Muir (1972/74) for positive definite matrices. △ Less

Submitted 27 March, 2018; originally announced March 2018.

Comments: 6 pages

arXiv:1802.05649 [pdf, other]

Learning Determinantal Point Processes by Corrective Negative Sampling

Authors: Zelda Mariet, Mike Gartrell, Suvrit Sra

Abstract: Determinantal Point Processes (DPPs) have attracted significant interest from the machine-learning community due to their ability to elegantly and tractably model the delicate balance between quality and diversity of sets. DPPs are commonly learned from data using maximum likelihood estimation (MLE). While fitting observed sets well, MLE for DPPs may also assign high likelihoods to unobserved sets… ▽ More Determinantal Point Processes (DPPs) have attracted significant interest from the machine-learning community due to their ability to elegantly and tractably model the delicate balance between quality and diversity of sets. DPPs are commonly learned from data using maximum likelihood estimation (MLE). While fitting observed sets well, MLE for DPPs may also assign high likelihoods to unobserved sets that are far from the true generative distribution of the data. To address this issue, which reduces the quality of the learned model, we introduce a novel optimization problem, Contrastive Estimation (CE), which encodes information about "negative" samples into the basic learning model. CE is grounded in the successful use of negative information in machine-vision and language modeling. Depending on the chosen negative distribution (which may be static or evolve during optimization), CE assumes two different forms, which we analyze theoretically and experimentally. We evaluate our new model on real-world datasets; on a challenging dataset, CE learning delivers a considerable improvement in predictive performance over a DPP learned without using contrastive information. △ Less

Submitted 26 February, 2019; v1 submitted 15 February, 2018; originally announced February 2018.

Comments: Will appear in AISTATS 2019

arXiv:1802.03487 [pdf, ps, other]

Small nonlinearities in activation functions create bad local minima in neural networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like… ▽ More We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic. △ Less

Submitted 28 May, 2019; v1 submitted 9 February, 2018; originally announced February 2018.

Comments: 33 pages, appeared at ICLR 2019

arXiv:1710.10770 [pdf, other]

Riemannian Optimization via Frank-Wolfe Methods

Authors: Melanie Weber, Suvrit Sra

Abstract: We study projection-free methods for constrained Riemannian optimization. In particular, we propose the Riemannian Frank-Wolfe (RFW) method. We analyze non-asymptotic convergence rates of RFW to an optimum for (geodesically) convex problems, and to a critical point for nonconvex objectives. We also present a practical setting under which RFW can attain a linear convergence rate. As a concrete exam… ▽ More We study projection-free methods for constrained Riemannian optimization. In particular, we propose the Riemannian Frank-Wolfe (RFW) method. We analyze non-asymptotic convergence rates of RFW to an optimum for (geodesically) convex problems, and to a critical point for nonconvex objectives. We also present a practical setting under which RFW can attain a linear convergence rate. As a concrete example, we specialize RFW to the manifold of positive definite matrices and apply it to two tasks: (i) computing the matrix geometric mean (Riemannian centroid); and (ii) computing the Bures-Wasserstein barycenter. Both tasks involve geodesically convex interval constraints, for which we show that the Riemannian "linear" oracle required by RFW admits a closed-form solution; this result may be of independent interest. We further specialize RFW to the special orthogonal group and show that here too, the Riemannian "linear" oracle can be solved in closed form. Here, we describe an application to the synchronization of data matrices (Procrustes problem). We complement our theoretical results with an empirical comparison of RFW against state-of-the-art Riemannian optimization methods and observe that RFW performs competitively on the task of computing Riemannian centroids. △ Less

Submitted 24 November, 2021; v1 submitted 30 October, 2017; originally announced October 2017.

Comments: Under Review. Updated version with new section on approximately solving the RLO

MSC Class: 46N10; 15A24; 65K10; 49Q99

arXiv:1709.01434 [pdf, other]

A Generic Approach for Esca** Saddle points

Authors: Sashank J Reddi, Manzil Zaheer, Suvrit Sra, Barnabas Poczos, Francis Bach, Ruslan Salakhutdinov, Alexander J Smola

Abstract: A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them imp… ▽ More A central challenge to using first-order methods for optimizing nonconvex problems is the presence of saddle points. First-order methods often get stuck at saddle points, greatly deteriorating their performance. Typically, to escape from saddles one has to use second-order methods. However, most works on second-order methods rely extensively on expensive Hessian-based computations, making them impractical in large-scale settings. To tackle this challenge, we introduce a generic framework that minimizes Hessian based computations while at the same time provably converging to second-order critical points. Our framework carefully alternates between a first-order and a second-order subroutine, using the latter only close to saddle points, and yields convergence results competitive to the state-of-the-art. Empirical results suggest that our strategy also enjoys a good practical performance. △ Less

Submitted 5 September, 2017; originally announced September 2017.

arXiv:1707.02444 [pdf, ps, other]

Global optimality conditions for deep neural networks

Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie

Abstract: We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global mi… ▽ More We study the error landscape of deep linear and nonlinear neural networks with the squared error loss. Minimizing the loss of a deep linear neural network is a nonconvex problem, and despite recent progress, our understanding of this loss surface is still incomplete. For deep linear networks, we present necessary and sufficient conditions for a critical point of the risk function to be a global minimum. Surprisingly, our conditions provide an efficiently checkable test for global optimality, while such tests are typically intractable in nonconvex optimization. We further extend these results to deep nonlinear neural networks and prove similar sufficient conditions for global optimality, albeit in a more limited function space setting. △ Less

Submitted 24 March, 2018; v1 submitted 8 July, 2017; originally announced July 2017.

Comments: 14 pages. A camera-ready version that will appear at ICLR 2018

arXiv:1706.09549 [pdf, other]

Distributional Adversarial Networks

Authors: Chengtao Li, David Alvarez-Melis, Keyulu Xu, Stefanie Jegelka, Suvrit Sra

Abstract: We propose a framework for adversarial training that relies on a sample rather than a single sample point as the fundamental unit of discrimination. Inspired by discrepancy measures and two-sample tests between probability distributions, we propose two such distributional adversaries that operate and predict on samples, and show how they can be easily implemented on top of existing models. Various… ▽ More We propose a framework for adversarial training that relies on a sample rather than a single sample point as the fundamental unit of discrimination. Inspired by discrepancy measures and two-sample tests between probability distributions, we propose two such distributional adversaries that operate and predict on samples, and show how they can be easily implemented on top of existing models. Various experimental results show that generators trained with our distributional adversaries are much more stable and are remarkably less prone to mode collapse than traditional models trained with pointwise prediction discriminators. The application of our framework to domain adaptation also results in considerable improvement over recent state-of-the-art. △ Less

Submitted 9 July, 2017; v1 submitted 28 June, 2017; originally announced June 2017.

arXiv:1706.03267 [pdf, other]

An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization

Authors: Reshad Hosseini, Suvrit Sra

Abstract: We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemann… ▽ More We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemannian geometry of positive definite matrices, using which we cast Gmm parameter estimation as a Riemannian optimization problem. Surprisingly, such an out-of-the-box Riemannian formulation completely fails and proves much inferior to EM. This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization. We then develop (Riemannian) batch and stochastic gradient algorithms that outperform EM, often substantially. We provide a non-asymptotic convergence analysis for our stochastic method, which is also the first (to our knowledge) such global analysis for Riemannian stochastic gradient. Numerous empirical results are included to demonstrate the effectiveness of our methods. △ Less

Submitted 10 June, 2017; originally announced June 2017.

Comments: 21 pages, 6 figures

arXiv:1705.09677 [pdf, ps, other]

Elementary Symmetric Polynomials for Optimal Experimental Design

Authors: Zelda Mariet, Suvrit Sra

Abstract: We revisit the classical problem of optimal experimental design (OED) under a new mathematical model grounded in a geometric motivation. Specifically, we introduce models based on elementary symmetric polynomials; these polynomials capture "partial volumes" and offer a graded interpolation between the widely used A-optimal design and D-optimal design models, obtaining each of them as special cases… ▽ More We revisit the classical problem of optimal experimental design (OED) under a new mathematical model grounded in a geometric motivation. Specifically, we introduce models based on elementary symmetric polynomials; these polynomials capture "partial volumes" and offer a graded interpolation between the widely used A-optimal design and D-optimal design models, obtaining each of them as special cases. We analyze properties of our models, and derive both greedy and convex-relaxation algorithms for computing the associated designs. Our analysis establishes approximation guarantees on these algorithms, while our empirical results substantiate our claims and demonstrate a curious phenomenon concerning our greedy method. Finally, as a byproduct, we obtain new results on the theory of elementary symmetric polynomials that may be of independent interest. △ Less

Submitted 24 May, 2017; originally announced May 2017.

arXiv:1705.08583 [pdf, other]

Sequence Summarization Using Order-constrained Kernelized Feature Subspaces

Authors: Anoop Cherian, Suvrit Sra, Richard Hartley

Abstract: Representations that can compactly and effectively capture temporal evolution of semantic content are important to machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are characteri… ▽ More Representations that can compactly and effectively capture temporal evolution of semantic content are important to machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are characterized by their variations in time. As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in an RKHS, projections of data onto which captures their temporal order. We develop this idea further and show that such a pooling scheme can be cast as an order-constrained kernelized PCA objective; we then propose to use the parameters of a kernelized low-rank feature subspace as the representation of the sequences. We cast our formulation as an optimization problem on generalized Grassmann manifolds and then solve it efficiently using Riemannian optimization techniques. We present experiments on several action recognition datasets using diverse feature modalities and demonstrate state-of-the-art results. △ Less

Submitted 23 May, 2017; originally announced May 2017.

arXiv:1703.02674 [pdf, other]

Polynomial Time Algorithms for Dual Volume Sampling

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wide… ▽ More We study dual volume sampling, a method for selecting k columns from an n x m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wider adoption has been hampered by the lack of polynomial time sampling algorithms. We remove this hindrance by develo** an exact (randomized) polynomial time sampling algorithm as well as its derandomization. Thereafter, we study dual volume sampling via the theory of real stable polynomials and prove that its distribution satisfies the "Strong Rayleigh" property. This result has numerous consequences, including a provably fast-mixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. This sampler is closely related to classical algorithms for popular experimental design methods that are to date lacking theoretical analysis but are known to empirically work well. △ Less

Submitted 15 November, 2017; v1 submitted 7 March, 2017; originally announced March 2017.

arXiv:1608.01008 [pdf, other]

Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: We study probability measures induced by set functions with constraints. Such measures arise in a variety of real-world settings, where prior knowledge, resource limitations, or other pragmatic considerations impose constraints. We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them. Our first main result is for MCMC sampling from S… ▽ More We study probability measures induced by set functions with constraints. Such measures arise in a variety of real-world settings, where prior knowledge, resource limitations, or other pragmatic considerations impose constraints. We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them. Our first main result is for MCMC sampling from Strongly Rayleigh (SR) measures, for which we present sharp polynomial bounds on the mixing time. As a corollary, this result yields a fast mixing sampler for Determinantal Point Processes (DPPs), yielding (to our knowledge) the first provably fast MCMC sampler for DPPs since their inception over four decades ago. Beyond SR measures, we develop MCMC samplers for probabilistic models with hard constraints and identify sufficient conditions under which their chains mix rapidly. We illustrate our claims by empirically verifying the dependence of mixing times on the key factors governing our theoretical bounds. △ Less

Submitted 8 January, 2017; v1 submitted 2 August, 2016; originally announced August 2016.

Comments: The present version subsumes arXiv:1607.03559

arXiv:1607.08254 [pdf, other]

Stochastic Frank-Wolfe Methods for Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limite… ▽ More We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic Frank-Wolfe methods and analyze their convergence properties. For objective functions that decompose into a finite-sum, we leverage ideas from variance reduction techniques for convex optimization to obtain new variance reduced nonconvex Frank-Wolfe methods that have provably faster convergence than the classical Frank-Wolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting. △ Less

Submitted 29 July, 2016; v1 submitted 27 July, 2016; originally announced July 2016.

arXiv:1607.05002 [pdf, ps, other]

Geometric Mean Metric Learning

Authors: Pourya Habib Zadeh, Reshad Hosseini, Suvrit Sra

Abstract: We revisit the task of learning a Euclidean metric from data. We approach this problem from first principles and formulate it as a surprisingly simple optimization problem. Indeed, our formulation even admits a closed form solution. This solution possesses several very attractive properties: (i) an innate geometric appeal through the Riemannian geometry of positive definite matrices; (ii) ease of… ▽ More We revisit the task of learning a Euclidean metric from data. We approach this problem from first principles and formulate it as a surprisingly simple optimization problem. Indeed, our formulation even admits a closed form solution. This solution possesses several very attractive properties: (i) an innate geometric appeal through the Riemannian geometry of positive definite matrices; (ii) ease of interpretability; and (iii) computational speed several orders of magnitude faster than the widely used LMNN and ITML methods. Furthermore, on standard benchmark datasets, our closed-form solution consistently attains higher classification accuracy. △ Less

Submitted 18 July, 2016; originally announced July 2016.

Comments: 7 pages, 4 figures

arXiv:1607.03559 [pdf, other]

Fast Sampling for Strongly Rayleigh Measures with Application to Determinantal Point Processes

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: In this note we consider sampling from (non-homogeneous) strongly Rayleigh probability measures. As an important corollary, we obtain a fast mixing Markov Chain sampler for Determinantal Point Processes. In this note we consider sampling from (non-homogeneous) strongly Rayleigh probability measures. As an important corollary, we obtain a fast mixing Markov Chain sampler for Determinantal Point Processes. △ Less

Submitted 12 July, 2016; originally announced July 2016.

arXiv:1605.08374 [pdf, other]

Kronecker Determinantal Point Processes

Authors: Zelda Mariet, Suvrit Sra

Abstract: Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of $N$ items. They have recently gained prominence in several applications that rely on "diverse" subsets. However, their applicability to large problems is still limited due to the $\mathcal O(N^3)$ complexity of core tasks such as sampling and learning. We enable efficient sampling and learning for DPPs b… ▽ More Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of $N$ items. They have recently gained prominence in several applications that rely on "diverse" subsets. However, their applicability to large problems is still limited due to the $\mathcal O(N^3)$ complexity of core tasks such as sampling and learning. We enable efficient sampling and learning for DPPs by introducing KronDPP, a DPP model whose kernel matrix decomposes as a tensor product of multiple smaller kernel matrices. This decomposition immediately enables fast exact sampling. But contrary to what one may expect, leveraging the Kronecker product structure for speeding up DPP learning turns out to be more difficult. We overcome this challenge, and derive batch and stochastic optimization algorithms for efficiently learning the parameters of a KronDPP. △ Less

Submitted 26 May, 2016; originally announced May 2016.

arXiv:1605.07147 [pdf, other]

Riemannian SVRG: Fast Stochastic Optimization on Riemannian Manifolds

Authors: Hongyi Zhang, Sashank J. Reddi, Suvrit Sra

Abstract: We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds. Although variance reduction techniques for optimizing finite-sums have witnessed tremendous attention in the recent years, existing work is limited to vector space problems. We introduce Riemannian SVRG (RSVRG), a new variance reduced Riemannian optimization method. We analyze RSVRG for both geodesically… ▽ More We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds. Although variance reduction techniques for optimizing finite-sums have witnessed tremendous attention in the recent years, existing work is limited to vector space problems. We introduce Riemannian SVRG (RSVRG), a new variance reduced Riemannian optimization method. We analyze RSVRG for both geodesically convex and nonconvex (smooth) functions. Our analysis reveals that RSVRG inherits advantages of the usual SVRG method, but with factors depending on curvature of the manifold that influence its convergence. To our knowledge, RSVRG is the first provably fast stochastic Riemannian method. Moreover, our paper presents the first non-asymptotic complexity analysis (novel even for the batch setting) for nonconvex Riemannian optimization. Our results have several implications; for instance, they offer a Riemannian perspective on variance reduced PCA, which promises a short, transparent convergence analysis. △ Less

Submitted 7 April, 2017; v1 submitted 23 May, 2016; originally announced May 2016.

Comments: This is the final version that appeared in NIPS 2016. Our proof of Lemma 2 was incorrect in the previous arXiv version. (9 pages paper + 6 pages appendix)

Journal ref: Advances in Neural Information Processing Systems 29 (NIPS 2016)

arXiv:1605.06900 [pdf, other]

Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle… ▽ More We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle this issue, we develop fast stochastic algorithms that provably converge to a stationary point for constant minibatches. Furthermore, using a variant of these algorithms, we show provably faster convergence than batch proximal gradient descent. Finally, we prove global linear convergence rate for an interesting subclass of nonsmooth nonconvex functions, that subsumes several recent works. This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions. △ Less

Submitted 23 May, 2016; originally announced May 2016.

arXiv:1605.00316 [pdf, other]

Directional Statistics in Machine Learning: a Brief Review

Authors: Suvrit Sra

Abstract: The modern data analyst must cope with data encoded in various forms, vectors, matrices, strings, graphs, or more. Consequently, statistical and machine learning models tailored to different data encodings are important. We focus on data encoded as normalized vectors, so that their "direction" is more important than their magnitude. Specifically, we consider high-dimensional vectors that lie eithe… ▽ More The modern data analyst must cope with data encoded in various forms, vectors, matrices, strings, graphs, or more. Consequently, statistical and machine learning models tailored to different data encodings are important. We focus on data encoded as normalized vectors, so that their "direction" is more important than their magnitude. Specifically, we consider high-dimensional vectors that lie either on the surface of the unit hypersphere or on the real projective plane. For such data, we briefly review common mathematical models prevalent in machine learning, while also outlining some technical aspects, software, applications, and open mathematical challenges. △ Less

Submitted 1 May, 2016; originally announced May 2016.

Comments: 12 pages, slightly modified version of submitted book chapter

arXiv:1604.02027 [pdf, other]

Combinatorial Topic Models using Small-Variance Asymptotics

Authors: Ke Jiang, Suvrit Sra, Brian Kulis

Abstract: Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and propose a new objective function derived from LDA by passing to the small-varianc… ▽ More Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and propose a new objective function derived from LDA by passing to the small-variance limit. We minimize the derived objective by using ideas from combinatorial optimization, which results in a new, fast, and high-quality topic modeling algorithm. In particular, we show that our results are competitive with popular LDA-based topic modeling approaches, and also discuss the (dis)similarities between our approach and its probabilistic counterparts. △ Less

Submitted 26 May, 2016; v1 submitted 7 April, 2016; originally announced April 2016.

Comments: 19 pages

arXiv:1603.06160 [pdf, other]

Stochastic Variance Reduction for Nonconvex Optimization

Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary po… ▽ More We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. △ Less

Submitted 4 April, 2016; v1 submitted 19 March, 2016; originally announced March 2016.

Comments: Minor feedback changes

arXiv:1603.06159 [pdf, other]

Fast Incremental Method for Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconve… ▽ More We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconvex problems for which SAGA converges at a linear rate to the global optimum. Finally, we analyze the practically valuable regularized and minibatch variants of SAGA. To our knowledge, this paper presents the first analysis of fast convergence for an incremental aggregated gradient method for nonconvex problems. △ Less

Submitted 19 March, 2016; originally announced March 2016.

arXiv:1603.06052 [pdf, other]

Fast DPP Sampling for Nyström with Application to Kernel Methods

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: The Nyström method has long been popular for scaling up kernel methods. Its theoretical guarantees and empirical performance rely critically on the quality of the landmarks selected. We study landmark selection for Nyström using Determinantal Point Processes (DPPs), discrete probability models that allow tractable generation of diverse samples. We prove that landmarks selected via DPPs guarantee b… ▽ More The Nyström method has long been popular for scaling up kernel methods. Its theoretical guarantees and empirical performance rely critically on the quality of the landmarks selected. We study landmark selection for Nyström using Determinantal Point Processes (DPPs), discrete probability models that allow tractable generation of diverse samples. We prove that landmarks selected via DPPs guarantee bounds on approximation errors; subsequently, we analyze implications for kernel ridge regression. Contrary to prior reservations due to cubic complexity of DPPsampling, we show that (under certain conditions) Markov chain DPP sampling requires only linear time in the size of the data. We present several empirical results that support our theoretical analysis, and demonstrate the superior performance of DPP-based landmark selection compared with existing approaches. △ Less

Submitted 28 May, 2016; v1 submitted 19 March, 2016; originally announced March 2016.

arXiv:1602.06053 [pdf, other]

First-order Methods for Geodesically Convex Optimization

Authors: Hongyi Zhang, Suvrit Sra

Abstract: Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces. But unlike convex optimization, geodesically convex (g-convex) optimization is much less developed. In this paper we contribute to the understanding of g-convex optimization by develo** iteration complexity analysis for several first-order algorithms on Hadamard manifolds. Specifically, we prove upp… ▽ More Geodesic convexity generalizes the notion of (vector space) convexity to nonlinear metric spaces. But unlike convex optimization, geodesically convex (g-convex) optimization is much less developed. In this paper we contribute to the understanding of g-convex optimization by develo** iteration complexity analysis for several first-order algorithms on Hadamard manifolds. Specifically, we prove upper bounds for the global complexity of deterministic and stochastic (sub)gradient methods for optimizing smooth and nonsmooth g-convex functions, both with and without strong g-convexity. Our analysis also reveals how the manifold geometry, especially \emph{sectional curvature}, impacts convergence rates. To the best of our knowledge, our work is the first to provide global complexity analysis for first-order algorithms for general g-convex optimization. △ Less

Submitted 19 February, 2016; originally announced February 2016.

Comments: 21 pages

arXiv:1512.01904 [pdf, other]

Gauss quadrature for matrix inverse forms with applications

Authors: Chengtao Li, Suvrit Sra, Stefanie Jegelka

Abstract: We present a framework for accelerating a spectrum of machine learning algorithms that require computation of bilinear inverse forms $u^\top A^{-1}u$, where $A$ is a positive definite matrix and $u$ a given vector. Our framework is built on Gauss-type quadrature and easily scales to large, sparse matrices. Further, it allows retrospective computation of lower and upper bounds on $u^\top A^{-1}u$,… ▽ More We present a framework for accelerating a spectrum of machine learning algorithms that require computation of bilinear inverse forms $u^\top A^{-1}u$, where $A$ is a positive definite matrix and $u$ a given vector. Our framework is built on Gauss-type quadrature and easily scales to large, sparse matrices. Further, it allows retrospective computation of lower and upper bounds on $u^\top A^{-1}u$, which in turn accelerates several algorithms. We prove that these bounds tighten iteratively and converge at a linear (geometric) rate. To our knowledge, ours is the first work to demonstrate these key properties of Gauss-type quadrature, which is a classical and deeply studied topic. We illustrate empirical consequences of our results by using quadrature to accelerate machine learning tasks involving determinantal point processes and submodular optimization, and observe tremendous speedups in several instances. △ Less

Submitted 28 May, 2016; v1 submitted 6 December, 2015; originally announced December 2015.

arXiv:1511.05077 [pdf, other]

Diversity Networks: Neural Network Compression Using Determinantal Point Processes

Authors: Zelda Mariet, Suvrit Sra

Abstract: We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible te… ▽ More We introduce Divnet, a flexible technique for learning networks with diverse neurons. Divnet models neuronal diversity by placing a Determinantal Point Process (DPP) over neurons in a given layer. It uses this DPP to select a subset of diverse neurons and subsequently fuses the redundant neurons into the selected ones. Compared with previous approaches, Divnet offers a more principled, flexible technique for capturing neuronal diversity and thus implicitly enforcing regularization. This enables effective auto-tuning of network architecture and leads to smaller network sizes without hurting performance. Moreover, through its focus on diversity and neuron fusing, Divnet remains compatible with other procedures that seek to reduce memory footprints of networks. We present experimental results to corroborate our claims: for pruning neural networks, Divnet is seen to be notably superior to competing approaches. △ Less

Submitted 18 April, 2017; v1 submitted 16 November, 2015; originally announced November 2015.

Comments: This paper appeared under the shorter title Diversity Networks at ICLR 2016 (http://www.iclr.cc/doku.php?id=iclr2016:main#accepted_papers_conference_track)

arXiv:1509.05902 [pdf, other]

Logarithmic inequalities under an elementary symmetric polynomial dominance order

Authors: Suvrit Sra

Abstract: We consider a dominance order on positive vectors induced by the elementary symmetric polynomials. Under this dominance order we provide conditions that yield simple proofs of several monotonicity questions. Notably, our approach yields a quick (4 line) proof of the so-called \emph{"sum-of-squared-logarithms"} inequality conjectured in (P.~Neff, B.~Eidel, F.~Osterbrink, and R.~Martin, \emph{Applie… ▽ More We consider a dominance order on positive vectors induced by the elementary symmetric polynomials. Under this dominance order we provide conditions that yield simple proofs of several monotonicity questions. Notably, our approach yields a quick (4 line) proof of the so-called \emph{"sum-of-squared-logarithms"} inequality conjectured in (P.~Neff, B.~Eidel, F.~Osterbrink, and R.~Martin, \emph{Applied Math. \& Mechanics., 2013}; P.~Neff, Y.~Nakatsukasa, and A.~Fischle; \emph{SIMAX, 35, 2014}). This inequality has been the subject of several recent articles, and only recently it received a full proof, albeit via a more elaborate complex-analytic approach. We provide an elementary proof, which moreover extends to yield simple proofs of both old and new inequalities for Rényi entropy, subentropy, and quantum Rényi entropy. △ Less

Submitted 22 June, 2017; v1 submitted 19 September, 2015; originally announced September 2015.

Comments: 6 pages; updated typesetting, some minor bugfixes

arXiv:1509.02447 [pdf, other]

Efficient Structured Matrix Rank Minimization

Authors: Adams Wei Yu, Wanli Ma, Yaoliang Yu, Jaime G. Carbonell, Suvrit Sra

Abstract: We study the problem of finding structured low-rank matrices using nuclear norm regularization where the structure is encoded by a linear map. In contrast to most known approaches for linearly structured rank minimization, we do not (a) use the full SVD, nor (b) resort to augmented Lagrangian techniques, nor (c) solve linear systems per iteration. Instead, we formulate the problem differently so t… ▽ More We study the problem of finding structured low-rank matrices using nuclear norm regularization where the structure is encoded by a linear map. In contrast to most known approaches for linearly structured rank minimization, we do not (a) use the full SVD, nor (b) resort to augmented Lagrangian techniques, nor (c) solve linear systems per iteration. Instead, we formulate the problem differently so that it is amenable to a generalized conditional gradient method, which results in a practical improvement with low per iteration computational cost. Numerical results show that our approach significantly outperforms state-of-the-art competitors in terms of running time, while effectively recovering low rank solutions in stochastic system realization and spectral compressed sensing problems. △ Less

Submitted 8 September, 2015; originally announced September 2015.

arXiv:1509.01618 [pdf, other]

Efficient Sampling for k-Determinantal Point Processes

Authors: Chengtao Li, Stefanie Jegelka, Suvrit Sra

Abstract: Determinantal Point Processes (DPPs) are elegant probabilistic models of repulsion and diversity over discrete sets of items. But their applicability to large sets is hindered by expensive cubic-complexity matrix operations for basic tasks such as sampling. In light of this, we propose a new method for approximate sampling from discrete $k$-DPPs. Our method takes advantage of the diversity propert… ▽ More Determinantal Point Processes (DPPs) are elegant probabilistic models of repulsion and diversity over discrete sets of items. But their applicability to large sets is hindered by expensive cubic-complexity matrix operations for basic tasks such as sampling. In light of this, we propose a new method for approximate sampling from discrete $k$-DPPs. Our method takes advantage of the diversity property of subsets sampled from a DPP, and proceeds in two stages: first it constructs coresets for the ground set of items; thereafter, it efficiently samples subsets based on the constructed coresets. As opposed to previous approaches, our algorithm aims to minimize the total variation distance to the original distribution. Experiments on both synthetic and real datasets indicate that our sampling algorithm works efficiently on large data sets, and yields more accurate samples than previous approaches. △ Less

Submitted 27 May, 2016; v1 submitted 4 September, 2015; originally announced September 2015.

arXiv:1508.05003 [pdf, other]

AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization

Authors: Suvrit Sra, Adams Wei Yu, Mu Li, Alexander J. Smola

Abstract: We study distributed stochastic convex optimization under the delayed gradient model where the server nodes perform parameter updates, while the worker nodes compute stochastic gradients. We discuss, analyze, and experiment with a setup motivated by the behavior of real-world distributed computation networks, where the machines are differently slow at different time. Therefore, we allow the parame… ▽ More We study distributed stochastic convex optimization under the delayed gradient model where the server nodes perform parameter updates, while the worker nodes compute stochastic gradients. We discuss, analyze, and experiment with a setup motivated by the behavior of real-world distributed computation networks, where the machines are differently slow at different time. Therefore, we allow the parameter updates to be sensitive to the actual delays experienced, rather than to worst-case bounds on the maximum delay. This sensitivity leads to larger stepsizes, that can help gain rapid initial convergence without having to wait too long for slower machines, while maintaining the same asymptotic complexity. We obtain encouraging improvements to overall convergence for distributed experiments on real datasets with up to billions of examples and features. △ Less

Submitted 20 August, 2015; originally announced August 2015.

Comments: 19 pages

arXiv:1508.04039 [pdf, ps, other]

The sum of squared logarithms inequality in arbitrary dimensions

Authors: Lev Borisov, Patrizio Neff, Suvrit Sra, Christian Thiel

Abstract: We prove the \emph{sum of squared logarithms inequality} (SSLI) which states that for nonnegative vectors $x, y \in \mathbb{R}^n$ whose elementary symmetric polynomials satisfy $e_k(x)\le e_k(y)$ (for $1\le k < n$) and $e_n(x)=e_n(y)$, the inequality $\sum_i (\log x_i)^2 \le \sum_i (\log y_i)^2$ holds. Our proof of this inequality follows by a suitable extension to the complex plane. In particular… ▽ More We prove the \emph{sum of squared logarithms inequality} (SSLI) which states that for nonnegative vectors $x, y \in \mathbb{R}^n$ whose elementary symmetric polynomials satisfy $e_k(x)\le e_k(y)$ (for $1\le k < n$) and $e_n(x)=e_n(y)$, the inequality $\sum_i (\log x_i)^2 \le \sum_i (\log y_i)^2$ holds. Our proof of this inequality follows by a suitable extension to the complex plane. In particular, we show that the function $f\colon M\subseteq \mathbb{C}^n\to \mathbb{R}$ with $f(z)=\sum_i(\log z_i)^2$ has nonnegative partial derivatives with respect to the elementary symmetric polynomials of $z$. This property leads to our proof. We conclude by providing applications and wider connections of the SSLI. △ Less

Submitted 2 November, 2015; v1 submitted 17 August, 2015; originally announced August 2015.

MSC Class: 26D05; 26D07; 30C15; 97H20

arXiv:1508.00792 [pdf, other]

Fixed-point algorithms for learning determinantal point processes

Authors: Zelda Mariet, Suvrit Sra

Abstract: Determinantal point processes (DPPs) offer an elegant tool for encoding probabilities over subsets of a ground set. Discrete DPPs are parametrized by a positive semidefinite matrix (called the DPP kernel), and estimating this kernel is key to learning DPPs from observed data. We consider the task of learning the DPP kernel, and develop for it a surprisingly simple yet effective new algorithm. Our… ▽ More Determinantal point processes (DPPs) offer an elegant tool for encoding probabilities over subsets of a ground set. Discrete DPPs are parametrized by a positive semidefinite matrix (called the DPP kernel), and estimating this kernel is key to learning DPPs from observed data. We consider the task of learning the DPP kernel, and develop for it a surprisingly simple yet effective new algorithm. Our algorithm offers the following benefits over previous approaches: (a) it is much simpler; (b) it yields equally good and sometimes even better local maxima; and (c) it runs an order of magnitude faster on large problems. We present experimental results on both real and simulated data to illustrate the numerical performance of our technique. △ Less

Submitted 8 October, 2015; v1 submitted 4 August, 2015; originally announced August 2015.

Comments: ICML, 2015

arXiv:1507.08366 [pdf, other]

On the matrix square root via geometric optimization

Authors: Suvrit Sra

Abstract: This paper is triggered by the preprint "\emph{Computing Matrix Squareroot via Non Convex Local Search}" by Jain et al. (\textit{\textcolor{blue}{arXiv:1507.05854}}), which analyzes gradient-descent for computing the square root of a positive definite matrix. Contrary to claims of~\citet{jain2015}, our experiments reveal that Newton-like methods compute matrix square roots rapidly and reliably, ev… ▽ More This paper is triggered by the preprint "\emph{Computing Matrix Squareroot via Non Convex Local Search}" by Jain et al. (\textit{\textcolor{blue}{arXiv:1507.05854}}), which analyzes gradient-descent for computing the square root of a positive definite matrix. Contrary to claims of~\citet{jain2015}, our experiments reveal that Newton-like methods compute matrix square roots rapidly and reliably, even for highly ill-conditioned matrices and without requiring commutativity. We observe that gradient-descent converges very slowly primarily due to tiny step-sizes and ill-conditioning. We derive an alternative first-order method based on geodesic convexity: our method admits a transparent convergence analysis ($< 1$ page), attains linear rate, and displays reliable convergence even for rank deficient problems. Though superior to gradient-descent, ultimately our method is also outperformed by a well-known scaled Newton method. Nevertheless, the primary value of our work is its conceptual value: it shows that for deriving gradient based methods for the matrix square root, \emph{the manifold geometric view of positive definite matrices can be much more advantageous than the Euclidean view}. △ Less

Submitted 16 December, 2015; v1 submitted 29 July, 2015; originally announced July 2015.

Comments: 8 pages, 12 plots, this version contains several more references and more words about the rank-deficient case

arXiv:1507.02772 [pdf, ps, other]

Riemannian Dictionary Learning and Sparse Coding for Positive Definite Matrices

Authors: Anoop Cherian, Suvrit Sra

Abstract: Data encoded as symmetric positive definite (SPD) matrices frequently arise in many areas of computer vision and machine learning. While these matrices form an open subset of the Euclidean space of symmetric matrices, viewing them through the lens of non-Euclidean Riemannian geometry often turns out to be better suited in capturing several desirable data properties. However, formulating classical… ▽ More Data encoded as symmetric positive definite (SPD) matrices frequently arise in many areas of computer vision and machine learning. While these matrices form an open subset of the Euclidean space of symmetric matrices, viewing them through the lens of non-Euclidean Riemannian geometry often turns out to be better suited in capturing several desirable data properties. However, formulating classical machine learning algorithms within such a geometry is often non-trivial and computationally expensive. Inspired by the great success of dictionary learning and sparse coding for vector-valued data, our goal in this paper is to represent data in the form of SPD matrices as sparse conic combinations of SPD atoms from a learned dictionary via a Riemannian geometric approach. To that end, we formulate a novel Riemannian optimization objective for dictionary learning and sparse coding in which the representation loss is characterized via the affine invariant Riemannian metric. We also present a computationally simple algorithm for optimizing our model. Experiments on several computer vision datasets demonstrate superior classification and retrieval performance using our approach when compared to sparse coding via alternative non-Riemannian formulations. △ Less

Submitted 16 December, 2015; v1 submitted 9 July, 2015; originally announced July 2015.

arXiv:1506.07677 [pdf, other]

Manifold Optimization for Gaussian Mixture Models

Authors: Reshad Hosseini, Suvrit Sra

Abstract: We take a new look at parameter estimation for Gaussian Mixture Models (GMMs). In particular, we propose using \emph{Riemannian manifold optimization} as a powerful counterpart to Expectation Maximization (EM). An out-of-the-box invocation of manifold optimization, however, fails spectacularly: it converges to the same solution but vastly slower. Driven by intuition from manifold convexity, we the… ▽ More We take a new look at parameter estimation for Gaussian Mixture Models (GMMs). In particular, we propose using \emph{Riemannian manifold optimization} as a powerful counterpart to Expectation Maximization (EM). An out-of-the-box invocation of manifold optimization, however, fails spectacularly: it converges to the same solution but vastly slower. Driven by intuition from manifold convexity, we then propose a reparamerization that has remarkable empirical consequences. It makes manifold optimization not only match EM---a highly encouraging result in itself given the poor record nonlinear programming methods have had against EM so far---but also outperform EM in many practical settings, while displaying much less variability in running times. We further highlight the strengths of manifold optimization by develo** a somewhat tuned manifold LBFGS method that proves even more competitive and reliable than existing manifold optimization tools. We hope that our results encourage a wider consideration of manifold optimization for parameter estimation problems. △ Less

Submitted 25 June, 2015; originally announced June 2015.

Comments: 19 pages

arXiv:1506.06840 [pdf, other]

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, Alex Smola

Abstract: We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale… ▽ More We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). Remarkable recent progress has been made in this direction through development of algorithms like SAG, SVRG, SAGA. These algorithms have been shown to outperform SGD, both theoretically and empirically. However, asynchronous versions of these algorithms---a crucial requirement for modern large-scale applications---have not been studied. We bridge this gap by presenting a unifying framework for many variance reduction techniques. Subsequently, we propose an asynchronous algorithm grounded in our framework, and prove its fast convergence. An important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as SVRG and SAGA as a byproduct. Our method achieves near linear speedup in sparse settings common to machine learning. We demonstrate the empirical performance of our method through a concrete realization of asynchronous SVRG. △ Less

Submitted 24 January, 2016; v1 submitted 22 June, 2015; originally announced June 2015.

arXiv:1503.01563 [pdf, other]

Convex Optimization for Parallel Energy Minimization

Authors: K. S. Sesh Kumar, Alvaro Barbero, Stefanie Jegelka, Suvrit Sra, Francis Bach

Abstract: Energy minimization has been an intensely studied core problem in computer vision. With growing image sizes (2D and 3D), it is now highly desirable to run energy minimization algorithms in parallel. But many existing algorithms, in particular, some efficient combinatorial algorithms, are difficult to par-allelize. By exploiting results from convex and submodular theory, we reformulate the quadrati… ▽ More Energy minimization has been an intensely studied core problem in computer vision. With growing image sizes (2D and 3D), it is now highly desirable to run energy minimization algorithms in parallel. But many existing algorithms, in particular, some efficient combinatorial algorithms, are difficult to par-allelize. By exploiting results from convex and submodular theory, we reformulate the quadratic energy minimization problem as a total variation denoising problem, which, when viewed geometrically, enables the use of projection and reflection based convex methods. The resulting min-cut algorithm (and code) is conceptually very simple, and solves a sequence of TV denoising problems. We perform an extensive empirical evaluation comparing state-of-the-art combinatorial algorithms and convex optimization techniques. On small problems the iterative convex methods match the combinatorial max-flow algorithms, while on larger problems they offer other flexibility and important gains: (a) their memory footprint is small; (b) their straightforward parallelizability fits multi-core platforms; (c) they can easily be warm-started; and (d) they quickly reach approximately good solutions, thereby enabling faster "inexact" solutions. A key consequence of our approach based on submodularity and convexity is that it is allows to combine any arbitrary combinatorial or convex methods as subroutines, which allows one to obtain hybrid combinatorial and convex optimization algorithms that benefit from the strengths of both. △ Less

Submitted 5 March, 2015; originally announced March 2015.

arXiv:1502.04753 [pdf, other]

doi 10.1016/j.ejc.2015.07.005

On inequalities for normalized Schur functions

Authors: Suvrit Sra

Abstract: We prove a conjecture of Cuttler et al.~[2011] [A. Cuttler, C. Greene, and M. Skandera; \emph{Inequalities for symmetric means}. European J. Combinatorics, 32(2011), 745--761] on the monotonicity of \emph{normalized Schur functions} under the usual (dominance) partial-order on partitions. We believe that our proof technique may be helpful in obtaining similar inequalities for other symmetric funct… ▽ More We prove a conjecture of Cuttler et al.~[2011] [A. Cuttler, C. Greene, and M. Skandera; \emph{Inequalities for symmetric means}. European J. Combinatorics, 32(2011), 745--761] on the monotonicity of \emph{normalized Schur functions} under the usual (dominance) partial-order on partitions. We believe that our proof technique may be helpful in obtaining similar inequalities for other symmetric functions. △ Less

Submitted 20 July, 2015; v1 submitted 16 February, 2015; originally announced February 2015.

Comments: This version fixes the error of the previous one

Showing 51–100 of 119 results for author: Sra, S