-
Towards Quantifying the Preconditioning Effect of Adam
Authors:
Rudrajit Das,
Naman Agarwal,
Sujay Sanghavi,
Inderjit S. Dhillon
Abstract:
There is a notable dearth of results characterizing the preconditioning effect of Adam and showing how it may alleviate the curse of ill-conditioning -- an issue plaguing gradient descent (GD). In this work, we perform a detailed analysis of Adam's preconditioning effect for quadratic functions and quantify to what extent Adam can mitigate the dependence on the condition number of the Hessian. Our…
▽ More
There is a notable dearth of results characterizing the preconditioning effect of Adam and showing how it may alleviate the curse of ill-conditioning -- an issue plaguing gradient descent (GD). In this work, we perform a detailed analysis of Adam's preconditioning effect for quadratic functions and quantify to what extent Adam can mitigate the dependence on the condition number of the Hessian. Our key finding is that Adam can suffer less from the condition number but at the expense of suffering a dimension-dependent quantity. Specifically, for a $d$-dimensional quadratic with a diagonal Hessian having condition number $κ$, we show that the effective condition number-like quantity controlling the iteration complexity of Adam without momentum is $\mathcal{O}(\min(d, κ))$. For a diagonally dominant Hessian, we obtain a bound of $\mathcal{O}(\min(d \sqrt{d κ}, κ))$ for the corresponding quantity. Thus, when $d < \mathcal{O}(κ^p)$ where $p = 1$ for a diagonal Hessian and $p = 1/3$ for a diagonally dominant Hessian, Adam can outperform GD (which has an $\mathcal{O}(κ)$ dependence). On the negative side, our results suggest that Adam can be worse than GD for a sufficiently non-diagonal Hessian even if $d \ll \mathcal{O}(κ^{1/3})$; we corroborate this with empirical evidence. Finally, we extend our analysis to functions satisfying per-coordinate Lipschitz smoothness and a modified version of the Polyak-Łojasiewicz condition.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
A Computationally Efficient Sparsified Online Newton Method
Authors:
Fnu Devvrit,
Sai Surya Duvvuri,
Rohan Anil,
Vineet Gupta,
Cho-Jui Hsieh,
Inderjit Dhillon
Abstract:
Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton (SONew) method, a memory-efficient second-order alg…
▽ More
Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton (SONew) method, a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. The algorithm emerges from a novel use of the LogDet matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. Powering the method is a surprising fact -- imposing structured sparsity patterns, like tridiagonal and banded structure, requires little to no overhead, making it as efficient and parallelizable as first-order methods. In wall-clock time, tridiagonal SONew is only about 3% slower per step than first-order methods but gives overall gains due to much faster convergence. In contrast, one of the state-of-the-art (SOTA) memory-intensive second-order methods, Shampoo, is unable to scale to large benchmarks. Additionally, while Shampoo necessitates significant engineering efforts to scale to large benchmarks, SONew offers a more straightforward implementation, increasing its practical appeal. SONew code is available at: https://github.com/devvrit/SONew
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Accelerating Primal-dual Methods for Regularized Markov Decision Processes
Authors:
Haoya Li,
Hsiang-fu Yu,
Lexing Ying,
Inderjit Dhillon
Abstract:
Entropy regularized Markov decision processes have been widely used in reinforcement learning. This paper is concerned with the primal-dual formulation of the entropy regularized problems. Standard first-order methods suffer from slow convergence due to the lack of strict convexity and concavity. To address this issue, we first introduce a new quadratically convexified primal-dual formulation. The…
▽ More
Entropy regularized Markov decision processes have been widely used in reinforcement learning. This paper is concerned with the primal-dual formulation of the entropy regularized problems. Standard first-order methods suffer from slow convergence due to the lack of strict convexity and concavity. To address this issue, we first introduce a new quadratically convexified primal-dual formulation. The natural gradient ascent descent of the new formulation enjoys global convergence guarantee and exponential convergence rate. We also propose a new interpolating metric that further accelerates the convergence significantly. Numerical results are provided to demonstrate the performance of the proposed methods under multiple settings.
△ Less
Submitted 12 June, 2023; v1 submitted 21 February, 2022;
originally announced February 2022.
-
Approximate Newton policy gradient algorithms
Authors:
Haoya Li,
Samarth Gupta,
Hsiangfu Yu,
Lexing Ying,
Inderjit Dhillon
Abstract:
Policy gradient algorithms have been widely applied to Markov decision processes and reinforcement learning problems in recent years. Regularization with various entropy functions is often used to encourage exploration and improve stability. This paper proposes an approximate Newton method for the policy gradient algorithm with entropy regularization. In the case of Shannon entropy, the resulting…
▽ More
Policy gradient algorithms have been widely applied to Markov decision processes and reinforcement learning problems in recent years. Regularization with various entropy functions is often used to encourage exploration and improve stability. This paper proposes an approximate Newton method for the policy gradient algorithm with entropy regularization. In the case of Shannon entropy, the resulting algorithm reproduces the natural policy gradient algorithm. For other entropy functions, this method results in brand-new policy gradient algorithms. We prove that all these algorithms enjoy Newton-type quadratic convergence and that the corresponding gradient flow converges globally to the optimal solution. We use synthetic and industrial-scale examples to demonstrate that the proposed approximate Newton method typically converges in single-digit iterations, often orders of magnitude faster than other state-of-the-art algorithms.
△ Less
Submitted 8 June, 2023; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Robust Training in High Dimensions via Block Coordinate Geometric Median Descent
Authors:
Anish Acharya,
Abolfazl Hashemi,
Prateek Jain,
Sujay Sanghavi,
Inderjit S. Dhillon,
Ufuk Topcu
Abstract:
Geometric median (\textsc{Gm}) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying \textsc{G…
▽ More
Geometric median (\textsc{Gm}) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying \textsc{Gm} to only a judiciously chosen block of coordinates at a time and using a memory mechanism, one can retain the breakdown point of 0.5 for smooth non-convex problems, with non-asymptotic convergence rates comparable to the SGD with \textsc{Gm}.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
On the Convergence of Differentially Private Federated Learning on Non-Lipschitz Objectives, and with Normalized Client Updates
Authors:
Rudrajit Das,
Abolfazl Hashemi,
Sujay Sanghavi,
Inderjit S. Dhillon
Abstract:
There is a dearth of convergence results for differentially private federated learning (FL) with non-Lipschitz objective functions (i.e., when gradient norms are not bounded). The primary reason for this is that the clip** operation (i.e., projection onto an $\ell_2$ ball of a fixed radius called the clip** threshold) for bounding the sensitivity of the average update to each client's update i…
▽ More
There is a dearth of convergence results for differentially private federated learning (FL) with non-Lipschitz objective functions (i.e., when gradient norms are not bounded). The primary reason for this is that the clip** operation (i.e., projection onto an $\ell_2$ ball of a fixed radius called the clip** threshold) for bounding the sensitivity of the average update to each client's update introduces bias depending on the clip** threshold and the number of local steps in FL, and analyzing this is not easy. For Lipschitz functions, the Lipschitz constant serves as a trivial clip** threshold with zero bias. However, Lipschitzness does not hold in many practical settings; moreover, verifying it and computing the Lipschitz constant is hard. Thus, the choice of the clip** threshold is non-trivial and requires a lot of tuning in practice. In this paper, we provide the first convergence result for private FL on smooth \textit{convex} objectives \textit{for a general clip** threshold} -- \textit{without assuming Lipschitzness}. We also look at a simpler alternative to clip** (for bounding sensitivity) which is \textit{normalization} -- where we use only a scaled version of the unit vector along the client updates, completely discarding the magnitude information. {The resulting normalization-based private FL algorithm is theoretically shown to have better convergence than its clip**-based counterpart on smooth convex functions. We corroborate our theory with synthetic experiments as well as experiments on benchmarking datasets.
△ Less
Submitted 15 April, 2022; v1 submitted 13 June, 2021;
originally announced June 2021.
-
Faster Non-Convex Federated Learning via Global and Local Momentum
Authors:
Rudrajit Das,
Anish Acharya,
Abolfazl Hashemi,
Sujay Sanghavi,
Inderjit S. Dhillon,
Ufuk Topcu
Abstract:
We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(ε^{-1.5})$ to converge to an $ε$-stationary point (i.e., $\mathbb{E}[\|\nabla f(\bm{x})\|^2] \leq ε$) for smooth non-convex functions -- under arbitrary client heterogeneity and compressed communication -- compared to the $\mathcal{O}(ε^{-2})$ complexity of most prior works. Our key…
▽ More
We propose \texttt{FedGLOMO}, a novel federated learning (FL) algorithm with an iteration complexity of $\mathcal{O}(ε^{-1.5})$ to converge to an $ε$-stationary point (i.e., $\mathbb{E}[\|\nabla f(\bm{x})\|^2] \leq ε$) for smooth non-convex functions -- under arbitrary client heterogeneity and compressed communication -- compared to the $\mathcal{O}(ε^{-2})$ complexity of most prior works. Our key algorithmic idea that enables achieving this improved complexity is based on the observation that the convergence in FL is hampered by two sources of high variance: (i) the global server aggregation step with multiple local updates, exacerbated by client heterogeneity, and (ii) the noise of the local client-level stochastic gradients. By modeling the server aggregation step as a generalized gradient-type update, we propose a variance-reducing momentum-based global update at the server, which when applied in conjunction with variance-reduced local updates at the clients, enables \texttt{FedGLOMO} to enjoy an improved convergence rate. Moreover, we derive our results under a novel and more realistic client-heterogeneity assumption which we verify empirically -- unlike prior assumptions that are hard to verify. Our experiments illustrate the intrinsic variance reduction effect of \texttt{FedGLOMO}, which implicitly suppresses client-drift in heterogeneous data distribution settings and promotes communication efficiency.
△ Less
Submitted 24 October, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
On the Benefits of Multiple Gossip Steps in Communication-Constrained Decentralized Optimization
Authors:
Abolfazl Hashemi,
Anish Acharya,
Rudrajit Das,
Haris Vikalo,
Sujay Sanghavi,
Inderjit Dhillon
Abstract:
In decentralized optimization, it is common algorithmic practice to have nodes interleave (local) gradient descent iterations with gossip (i.e. averaging over the network) steps. Motivated by the training of large-scale machine learning models, it is also increasingly common to require that messages be {\em lossy compressed} versions of the local parameters. In this paper, we show that, in such co…
▽ More
In decentralized optimization, it is common algorithmic practice to have nodes interleave (local) gradient descent iterations with gossip (i.e. averaging over the network) steps. Motivated by the training of large-scale machine learning models, it is also increasingly common to require that messages be {\em lossy compressed} versions of the local parameters. In this paper, we show that, in such compressed decentralized optimization settings, there are benefits to having {\em multiple} gossip steps between subsequent gradient iterations, even when the cost of doing so is appropriately accounted for e.g. by means of reducing the precision of compressed information. In particular, we show that having $O(\log\frac{1}ε)$ gradient iterations {with constant step size} - and $O(\log\frac{1}ε)$ gossip steps between every pair of these iterations - enables convergence to within $ε$ of the optimal value for smooth non-convex objectives satisfying Polyak-Łojasiewicz condition. This result also holds for smooth strongly convex objectives. To our knowledge, this is the first work that derives convergence results for nonconvex optimization under arbitrary communication compression.
△ Less
Submitted 20 November, 2020;
originally announced November 2020.
-
Primal-Dual Block Frank-Wolfe
Authors:
Qi Lei,
Jiacheng Zhuo,
Constantine Caramanis,
Inderjit S. Dhillon,
Alexandros G. Dimakis
Abstract:
We propose a variant of the Frank-Wolfe algorithm for solving a class of sparse/low-rank optimization problems. Our formulation includes Elastic Net, regularized SVMs and phase retrieval as special cases. The proposed Primal-Dual Block Frank-Wolfe algorithm reduces the per-iteration cost while maintaining linear convergence rate. The per iteration cost of our method depends on the structural compl…
▽ More
We propose a variant of the Frank-Wolfe algorithm for solving a class of sparse/low-rank optimization problems. Our formulation includes Elastic Net, regularized SVMs and phase retrieval as special cases. The proposed Primal-Dual Block Frank-Wolfe algorithm reduces the per-iteration cost while maintaining linear convergence rate. The per iteration cost of our method depends on the structural complexity of the solution (i.e. sparsity/low-rank) instead of the ambient dimension. We empirically show that our algorithm outperforms the state-of-the-art methods on (multi-class) classification tasks.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
Discrete Adversarial Attacks and Submodular Optimization with Applications to Text Classification
Authors:
Qi Lei,
Lingfei Wu,
Pin-Yu Chen,
Alexandros G. Dimakis,
Inderjit S. Dhillon,
Michael Witbrock
Abstract:
Adversarial examples are carefully constructed modifications to an input that completely change the output of a classifier but are imperceptible to humans. Despite these successful attacks for continuous data (such as image and audio samples), generating adversarial examples for discrete structures such as text has proven significantly more challenging. In this paper we formulate the attacks with…
▽ More
Adversarial examples are carefully constructed modifications to an input that completely change the output of a classifier but are imperceptible to humans. Despite these successful attacks for continuous data (such as image and audio samples), generating adversarial examples for discrete structures such as text has proven significantly more challenging. In this paper we formulate the attacks with discrete input on a set function as an optimization task. We prove that this set function is submodular for some popular neural network text classifiers under simplifying assumption. This finding guarantees a $1-1/e$ approximation factor for attacks that use the greedy algorithm. Meanwhile, we show how to use the gradient of the attacked classifier to guide the greedy search. Empirical studies with our proposed optimization scheme show significantly improved attack ability and efficiency, on three different text classification tasks over various baselines. We also use a joint sentence and word paraphrasing technique to maintain the original semantics and syntax of the text. This is validated by a human subject evaluation in subjective metrics on the quality and semantic coherence of our generated adversarial text.
△ Less
Submitted 4 April, 2019; v1 submitted 1 December, 2018;
originally announced December 2018.
-
Online Embedding Compression for Text Classification using Low Rank Matrix Factorization
Authors:
Anish Acharya,
Rahul Goel,
Angeliki Metallinou,
Inderjit Dhillon
Abstract:
Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or introduce significant latency. We propose a compression method that leverages low rank matrix factorization during training,to compress the word embedding layer which…
▽ More
Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or introduce significant latency. We propose a compression method that leverages low rank matrix factorization during training,to compress the word embedding layer which represents the size bottleneck for most NLP models. Our models are trained, compressed and then further re-trained on the downstream task to recover accuracy while maintaining the reduced size. Empirically, we show that the proposed method can achieve 90% compression with minimal impact in accuracy for sentence classification tasks, and outperforms alternative methods like fixed-point quantization or offline word embedding compression. We also analyze the inference time and storage space for our method through FLOP calculations, showing that we can compress DNN models by a configurable ratio and regain accuracy loss without introducing additional latency compared to fixed point quantization. Finally, we introduce a novel learning rate schedule, the Cyclically Annealed Learning Rate (CALR), which we empirically demonstrate to outperform other popular adaptive learning rate algorithms on a sentence classification benchmark.
△ Less
Submitted 1 November, 2018;
originally announced November 2018.
-
Coordinate Descent Methods for Symmetric Nonnegative Matrix Factorization
Authors:
Arnaud Vandaele,
Nicolas Gillis,
Qi Lei,
Kai Zhong,
Inderjit Dhillon
Abstract:
Given a symmetric nonnegative matrix $A$, symmetric nonnegative matrix factorization (symNMF) is the problem of finding a nonnegative matrix $H$, usually with much fewer columns than $A$, such that $A \approx HH^T$. SymNMF can be used for data analysis and in particular for various clustering tasks. In this paper, we propose simple and very efficient coordinate descent schemes to solve this proble…
▽ More
Given a symmetric nonnegative matrix $A$, symmetric nonnegative matrix factorization (symNMF) is the problem of finding a nonnegative matrix $H$, usually with much fewer columns than $A$, such that $A \approx HH^T$. SymNMF can be used for data analysis and in particular for various clustering tasks. In this paper, we propose simple and very efficient coordinate descent schemes to solve this problem, and that can handle large and sparse input matrices. The effectiveness of our methods is illustrated on synthetic and real-world data sets, and we show that they perform favorably compared to recent state-of-the-art methods.
△ Less
Submitted 31 May, 2016; v1 submitted 4 September, 2015;
originally announced September 2015.
-
PU Learning for Matrix Completion
Authors:
Cho-Jui Hsieh,
Nagarajan Natarajan,
Inderjit S. Dhillon
Abstract:
In this paper, we consider the matrix completion problem when the observations are one-bit measurements of some underlying matrix M, and in particular the observed samples consist only of ones and no zeros. This problem is motivated by modern applications such as recommender systems and social networks where only "likes" or "friendships" are observed. The problem of learning from only positive and…
▽ More
In this paper, we consider the matrix completion problem when the observations are one-bit measurements of some underlying matrix M, and in particular the observed samples consist only of ones and no zeros. This problem is motivated by modern applications such as recommender systems and social networks where only "likes" or "friendships" are observed. The problem of learning from only positive and unlabeled examples, called PU (positive-unlabeled) learning, has been studied in the context of binary classification. We consider the PU matrix completion problem, where an underlying real-valued matrix M is first quantized to generate one-bit observations and then a subset of positive entries is revealed. Under the assumption that M has bounded nuclear norm, we provide recovery guarantees for two different observation models: 1) M parameterizes a distribution that generates a binary matrix, 2) M is thresholded to obtain a binary matrix. For the first case, we propose a "shifted matrix completion" method that recovers M using only a subset of indices corresponding to ones, while for the second case, we propose a "biased matrix completion" method that recovers the (thresholded) binary matrix. Both methods yield strong error bounds --- if M is n by n, the Frobenius error is bounded as O(1/((1-rho)n), where 1-rho denotes the fraction of ones observed. This implies a sample complexity of O(n\log n) ones to achieve a small error, when M is dense and n is large. We extend our methods and guarantees to the inductive matrix completion problem, where rows and columns of M have associated features. We provide efficient and scalable optimization procedures for both the methods and demonstrate the effectiveness of the proposed methods for link prediction (on real-world networks consisting of over 2 million nodes and 90 million links) and semi-supervised clustering tasks.
△ Less
Submitted 21 November, 2014;
originally announced November 2014.
-
Constructing packings in Grassmannian manifolds via alternating projection
Authors:
I. S. Dhillon,
R. W. Heath Jr,
T. Strohmer,
J. A. Tropp
Abstract:
This paper describes a numerical method for finding good packings in Grassmannian manifolds equipped with various metrics. This investigation also encompasses packing in projective spaces. In each case, producing a good packing is equivalent to constructing a matrix that has certain structural and spectral properties. By alternately enforcing the structural condition and then the spectral condit…
▽ More
This paper describes a numerical method for finding good packings in Grassmannian manifolds equipped with various metrics. This investigation also encompasses packing in projective spaces. In each case, producing a good packing is equivalent to constructing a matrix that has certain structural and spectral properties. By alternately enforcing the structural condition and then the spectral condition, it is often possible to reach a matrix that satisfies both. One may then extract a packing from this matrix.
This approach is both powerful and versatile. In cases where experiments have been performed, the alternating projection method yields packings that compete with the best packings recorded. It also extends to problems that have not been studied numerically. For example, it can be used to produce packings of subspaces in real and complex Grassmannian spaces equipped with the Fubini--Study distance; these packings are valuable in wireless communications. One can prove that some of the novel configurations constructed by the algorithm have packing diameters that are nearly optimal.
△ Less
Submitted 4 September, 2007;
originally announced September 2007.