Search | arXiv e-print repository

Empirical Tests of Optimization Assumptions in Deep Learning

Authors: Hoang Tran, Qinzi Zhang, Ashok Cutkosky

Abstract: There is a significant gap between our theoretical understanding of optimization algorithms used in deep learning and their practical performance. Theoretical development usually focuses on proving convergence guarantees under a variety of different assumptions, which are themselves often chosen based on a rough combination of intuitive match to practice and analytical convenience. The theory/prac… ▽ More There is a significant gap between our theoretical understanding of optimization algorithms used in deep learning and their practical performance. Theoretical development usually focuses on proving convergence guarantees under a variety of different assumptions, which are themselves often chosen based on a rough combination of intuitive match to practice and analytical convenience. The theory/practice gap may then arise because of the failure to prove a theorem under such assumptions, or because the assumptions do not reflect reality. In this paper, we carefully measure the degree to which these assumptions are capable of explaining modern optimization algorithms by develo** new empirical metrics that closely track the key quantities that must be controlled in theoretical analysis. All of our tested assumptions (including typical modern assumptions based on bounds on the Hessian) fail to reliably capture optimization performance. This highlights a need for new empirical verification of analytical assumptions used in theoretical analysis. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.19579 [pdf, ps, other]

Private Zeroth-Order Nonsmooth Nonconvex Optimization

Authors: Qinzi Zhang, Hoang Tran, Ashok Cutkosky

Abstract: We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives. Given a dataset of size $M$, our algorithm ensures $(α,αρ^2/2)$-Rényi differential privacy and finds a $(δ,ε)$-stationary point so long as $M=\tildeΩ\left(\frac{d}{δε^3} + \frac{d^{3/2}}{ρδε^2}\right)$. This matches the optimal complexity of its non-private zeroth-order analog. Nota… ▽ More We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives. Given a dataset of size $M$, our algorithm ensures $(α,αρ^2/2)$-Rényi differential privacy and finds a $(δ,ε)$-stationary point so long as $M=\tildeΩ\left(\frac{d}{δε^3} + \frac{d^{3/2}}{ρδε^2}\right)$. This matches the optimal complexity of its non-private zeroth-order analog. Notably, although the objective is not smooth, we have privacy ``for free'' whenever $ρ\ge \sqrt{d}ε$. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2405.20540 [pdf, ps, other]

Fully Unconstrained Online Learning

Authors: Ashok Cutkosky, Zakaria Mhammedi

Abstract: We provide an online learning algorithm that obtains regret $G\|w_\star\|\sqrt{T\log(\|w_\star\|G\sqrt{T})} + \|w_\star\|^2 + G^2$ on $G$-Lipschitz convex losses for any comparison point $w_\star$ without knowing either $G$ or $\|w_\star\|$. Importantly, this matches the optimal bound $G\|w_\star\|\sqrt{T}$ available with such knowledge (up to logarithmic factors), unless either $\|w_\star\|$ or… ▽ More We provide an online learning algorithm that obtains regret $G\|w_\star\|\sqrt{T\log(\|w_\star\|G\sqrt{T})} + \|w_\star\|^2 + G^2$ on $G$-Lipschitz convex losses for any comparison point $w_\star$ without knowing either $G$ or $\|w_\star\|$. Importantly, this matches the optimal bound $G\|w_\star\|\sqrt{T}$ available with such knowledge (up to logarithmic factors), unless either $\|w_\star\|$ or $G$ is so large that even $G\|w_\star\|\sqrt{T}$ is roughly linear in $T$. Thus, it matches the optimal bound in all cases in which one can achieve sublinear regret, which arguably most "interesting" scenarios. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.18199 [pdf, ps, other]

Adam with model exponential moving average is effective for nonconvex optimization

Authors: Kwangjun Ahn, Ashok Cutkosky

Abstract: In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth an… ▽ More In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth. Moreover, when the scale varies significantly across different coordinates, we demonstrate that the coordinate-wise adaptivity of Adam is provably advantageous. Notably, unlike previous analyses of Adam, our analysis crucially relies on its core elements -- momentum and discounting factors -- as well as model EMA, motivating their wide applications in practice. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Comments would be appreciated!

arXiv:2405.15682 [pdf, other]

The Road Less Scheduled

Authors: Aaron Defazio, Xingyu, Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky

Abstract: Existing learning rate schedules that do not require specification of the optimization stop** step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stop** time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from c… ▽ More Existing learning rate schedules that do not require specification of the optimization stop** step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stop** time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available (https://github.com/facebookresearch/schedule_free). △ Less

Submitted 30 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.09742 [pdf, other]

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Authors: Qinzi Zhang, Ashok Cutkosky

Abstract: Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at… ▽ More Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which classical analysis applies only if the loss is either convex or smooth. We show that a very small modification to SGDM closes this gap: simply scale the update at each time point by an exponentially distributed random scalar. The resulting algorithm achieves optimal convergence guarantees. Intriguingly, this result is not derived by a specific analysis of SGDM: instead, it falls naturally out of a more general framework for converting online convex optimization algorithms to non-convex optimization algorithms. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2302.03775 [pdf, ps, other]

Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion

Authors: Ashok Cutkosky, Harsh Mehta, Francesco Orabona

Abstract: We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a $(δ,ε)$-stationary point from $O(ε^{-4}δ^{-1})$ stochastic gradient queries to $O(ε^{-3}δ^{-1})$, which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to onl… ▽ More We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a $(δ,ε)$-stationary point from $O(ε^{-4}δ^{-1})$ stochastic gradient queries to $O(ε^{-3}δ^{-1})$, which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of $O(ε^{-1.5}δ^{-0.5})$. Our techniques also recover all optimal or best-known results for finding $ε$ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings. △ Less

Submitted 11 February, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

arXiv:2301.13349 [pdf, other]

Unconstrained Dynamic Regret via Sparse Coding

Authors: Zhiyu Zhang, Ashok Cutkosky, Ioannis Ch. Paschalidis

Abstract: Motivated by the challenge of nonstationarity in sequential decision making, we study Online Convex Optimization (OCO) under the coupling of two problem structures: the domain is unbounded, and the comparator sequence $u_1,\ldots,u_T$ is arbitrarily time-varying. As no algorithm can guarantee low regret simultaneously against all comparator sequences, handling this setting requires moving from min… ▽ More Motivated by the challenge of nonstationarity in sequential decision making, we study Online Convex Optimization (OCO) under the coupling of two problem structures: the domain is unbounded, and the comparator sequence $u_1,\ldots,u_T$ is arbitrarily time-varying. As no algorithm can guarantee low regret simultaneously against all comparator sequences, handling this setting requires moving from minimax optimality to comparator adaptivity. That is, sensible regret bounds should depend on certain complexity measures of the comparator relative to one's prior knowledge. This paper achieves a new type of these adaptive regret bounds via a sparse coding framework. The complexity of the comparator is measured by its energy and its sparsity on a user-specified dictionary, which offers considerable versatility. Equipped with a wavelet dictionary for example, our framework improves the state-of-the-art bound (Jacobsen & Cutkosky, 2022) by adapting to both ($i$) the magnitude of the comparator average $||\bar u||=||\sum_{t=1}^Tu_t/T||$, rather than the maximum $\max_t||u_t||$; and ($ii$) the comparator variability $\sum_{t=1}^T||u_t-\bar u||$, rather than the uncentered sum $\sum_{t=1}^T||u_t||$. Furthermore, our analysis is simpler due to decoupling function approximation from regret minimization. △ Less

Submitted 25 October, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

Comments: NeurIPS 2023

arXiv:2203.00444 [pdf, other]

Parameter-free Mirror Descent

Authors: Andrew Jacobsen, Ashok Cutkosky

Abstract: We develop a modified online mirror descent framework that is suitable for building adaptive and parameter-free algorithms in unbounded domains. We leverage this technique to develop the first unconstrained online linear optimization algorithm achieving an optimal dynamic regret bound, and we further demonstrate that natural strategies based on Follow-the-Regularized-Leader are unable to achieve s… ▽ More We develop a modified online mirror descent framework that is suitable for building adaptive and parameter-free algorithms in unbounded domains. We leverage this technique to develop the first unconstrained online linear optimization algorithm achieving an optimal dynamic regret bound, and we further demonstrate that natural strategies based on Follow-the-Regularized-Leader are unable to achieve similar results. We also apply our mirror descent framework to build new parameter-free implicit updates, as well as a simplified and improved unconstrained scale-free algorithm. △ Less

Submitted 8 February, 2024; v1 submitted 26 February, 2022; originally announced March 2022.

Comments: 59 pages. v4: Added a new section (7. Trade-offs in the Horizon Dependence) discussing how to achieve an alternative type of parameter-free bound using our framework; v3: published at COLT 2022 + fixed typos; v2: improved the algorithms in sections 3, 5, and 6 (tighter regret, simpler updates and analysis), corrected minor technical details and fixed typos

arXiv:2202.00089 [pdf, other]

Understanding AdamW through Proximal Methods and Scale-Freeness

Authors: Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, Francesco Orabona

Abstract: Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as Adam-$\ell_2$). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-$\ell_2$.… ▽ More Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as Adam-$\ell_2$). However, even better performance can be obtained with AdamW, which decouples the gradient of the regularizer from the update rule of Adam-$\ell_2$. Yet, we are still lacking a complete explanation of the advantages of AdamW. In this paper, we tackle this question from both an optimization and an empirical point of view. First, we show how to re-interpret AdamW as an approximation of a proximal gradient method, which takes advantage of the closed-form proximal map** of the regularizer instead of only utilizing its gradient information as in Adam-$\ell_2$. Next, we consider the property of "scale-freeness" enjoyed by AdamW and by its proximal counterpart: their updates are invariant to component-wise rescaling of the gradients. We provide empirical evidence across a wide range of deep learning experiments showing a correlation between the problems in which AdamW exhibits an advantage over Adam-$\ell_2$ and the degree to which we expect the gradients of the network to exhibit multiple scales, thus motivating the hypothesis that the advantage of AdamW could be due to the scale-free updates. △ Less

Submitted 31 January, 2022; originally announced February 2022.

arXiv:2106.14343 [pdf, other]

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

Authors: Ashok Cutkosky, Harsh Mehta

Abstract: We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clip**, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded $\mathfrak{p}$th moments for some… ▽ More We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clip**, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded $\mathfrak{p}$th moments for some $\mathfrak{p}\in(1,2]$. We then consider the case of second-order smooth losses, which to our knowledge have not been studied in this setting, and again obtain high-probability bounds for any $\mathfrak{p}$. Moreover, our results hold for arbitrary smooth norms, in contrast to the typical SGD analysis which requires a Hilbert space norm. Further, we show that after a suitable "burn-in" period, the objective value will monotonically decrease for every iteration until a critical point is identified, which provides intuition behind the popular practice of learning rate "warm-up" and also yields a last-iterate guarantee. △ Less

Submitted 9 November, 2021; v1 submitted 27 June, 2021; originally announced June 2021.

arXiv:2002.04726 [pdf, ps, other]

Online Learning with Imperfect Hints

Authors: Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

Abstract: We consider a variant of the classical online linear optimization problem in which at every step, the online player receives a "hint" vector before choosing the action for that round. Rather surprisingly, it was shown that if the hint vector is guaranteed to have a positive correlation with the cost vector, then the online player can achieve a regret of $O(\log T)$, thus significantly improving ov… ▽ More We consider a variant of the classical online linear optimization problem in which at every step, the online player receives a "hint" vector before choosing the action for that round. Rather surprisingly, it was shown that if the hint vector is guaranteed to have a positive correlation with the cost vector, then the online player can achieve a regret of $O(\log T)$, thus significantly improving over the $O(\sqrt{T})$ regret in the general setting. However, the result and analysis require the correlation property at \emph{all} time steps, thus raising the natural question: can we design online learning algorithms that are resilient to bad hints? In this paper we develop algorithms and nearly matching lower bounds for online learning with imperfect directional hints. Our algorithms are oblivious to the quality of the hints, and the regret bounds interpolate between the always-correlated hints case and the no-hints case. Our results also generalize, simplify, and improve upon previous results on optimistic regret bounds, which can be viewed as an additive version of hints. △ Less

Submitted 2 October, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

Comments: appeared in ICML 2020

arXiv:2002.03305 [pdf, other]

Momentum Improves Normalized SGD

Authors: Ashok Cutkosky, Harsh Mehta

Abstract: We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an $ε$-critical point in $O(1/ε^{3.5})$ iterations, matching the b… ▽ More We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an $ε$-critical point in $O(1/ε^{3.5})$ iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks. △ Less

Submitted 16 May, 2020; v1 submitted 9 February, 2020; originally announced February 2020.

arXiv:1905.12721 [pdf, other]

Matrix-Free Preconditioning in Online Learning

Authors: Ashok Cutkosky, Tamas Sarlos

Abstract: We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. Our regret bound is never worse than that obtained by diagonal preconditioning, and in certain setting even surpasses that of algorithms with full-matrix preconditioning. Importantly, our algorit… ▽ More We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. Our regret bound is never worse than that obtained by diagonal preconditioning, and in certain setting even surpasses that of algorithms with full-matrix preconditioning. Importantly, our algorithm runs in the same time and space complexity as online gradient descent. Along the way we incorporate new techniques that mildly streamline and improve logarithmic factors in prior regret analyses. We conclude by benchmarking our algorithm on synthetic data and deep learning tasks. △ Less

Submitted 29 May, 2019; originally announced May 2019.

Comments: ICML 2019

arXiv:1905.10018 [pdf, other]

Momentum-Based Variance Reduction in Non-Convex SGD

Authors: Ashok Cutkosky, Francesco Orabona

Abstract: Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-bat… ▽ More Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses $F$, STORM finds a point $\boldsymbol{x}$ with $\mathbb{E}[\|\nabla F(\boldsymbol{x})\|]\le O(1/\sqrt{T}+σ^{1/3}/T^{1/3})$ in $T$ iterations with $σ^2$ variance in the gradients, matching the optimal rate but without requiring knowledge of $σ$. △ Less

Submitted 21 April, 2020; v1 submitted 23 May, 2019; originally announced May 2019.

Comments: Added Ack

arXiv:1903.00974 [pdf, ps, other]

Anytime Online-to-Batch Conversions, Optimism, and Acceleration

Authors: Ashok Cutkosky

Abstract: A standard way to obtain convergence guarantees in stochastic convex optimization is to run an online learning algorithm and then output the average of its iterates: the actual iterates of the online learning algorithm do not come with individual guarantees. We close this gap by introducing a black-box modification to any online learning algorithm whose iterates converge to the optimum in stochast… ▽ More A standard way to obtain convergence guarantees in stochastic convex optimization is to run an online learning algorithm and then output the average of its iterates: the actual iterates of the online learning algorithm do not come with individual guarantees. We close this gap by introducing a black-box modification to any online learning algorithm whose iterates converge to the optimum in stochastic scenarios. We then consider the case of smooth losses, and show that combining our approach with optimistic online learning algorithms immediately yields a fast convergence rate of $O(L/T^{3/2}+σ/\sqrt{T})$ on $L$-smooth problems with $σ^2$ variance in the gradients. Finally, we provide a reduction that converts any adaptive online algorithm into one that obtains the optimal accelerated rate of $\tilde O(L/T^2 + σ/\sqrt{T})$, while still maintaining $\tilde O(1/\sqrt{T})$ convergence in the non-smooth setting. Importantly, our algorithms adapt to $L$ and $σ$ automatically: they do not need to know either to obtain these rates. △ Less

Submitted 3 March, 2019; originally announced March 2019.

arXiv:1902.09013 [pdf, ps, other]

Artificial Constraints and Lipschitz Hints for Unconstrained Online Learning

Authors: Ashok Cutkosky

Abstract: We provide algorithms that guarantee regret $R_T(u)\le \tilde O(G\|u\|^3 + G(\|u\|+1)\sqrt{T})$ or $R_T(u)\le \tilde O(G\|u\|^3T^{1/3} + GT^{1/3}+ G\|u\|\sqrt{T})$ for online convex optimization with $G$-Lipschitz losses for any comparison point $u$ without prior knowledge of either $G$ or $\|u\|$. Previous algorithms dispense with the $O(\|u\|^3)$ term at the expense of knowledge of one or both o… ▽ More We provide algorithms that guarantee regret $R_T(u)\le \tilde O(G\|u\|^3 + G(\|u\|+1)\sqrt{T})$ or $R_T(u)\le \tilde O(G\|u\|^3T^{1/3} + GT^{1/3}+ G\|u\|\sqrt{T})$ for online convex optimization with $G$-Lipschitz losses for any comparison point $u$ without prior knowledge of either $G$ or $\|u\|$. Previous algorithms dispense with the $O(\|u\|^3)$ term at the expense of knowledge of one or both of these parameters, while a lower bound shows that some additional penalty term over $G\|u\|\sqrt{T}$ is necessary. Previous penalties were exponential while our bounds are polynomial in all quantities. Further, given a known bound $\|u\|\le D$, our same techniques allow us to design algorithms that adapt optimally to the unknown value of $\|u\|$ without requiring knowledge of $G$. △ Less

Submitted 24 February, 2019; originally announced February 2019.

arXiv:1902.09003 [pdf, ps, other]

Combining Online Learning Guarantees

Authors: Ashok Cutkosky

Abstract: We show how to take any two parameter-free online learning algorithms with different regret guarantees and obtain a single algorithm whose regret is the minimum of the two base algorithms. Our method is embarrassingly simple: just add the iterates. This trick can generate efficient algorithms that adapt to many norms simultaneously, as well as providing diagonal-style algorithms that still maintai… ▽ More We show how to take any two parameter-free online learning algorithms with different regret guarantees and obtain a single algorithm whose regret is the minimum of the two base algorithms. Our method is embarrassingly simple: just add the iterates. This trick can generate efficient algorithms that adapt to many norms simultaneously, as well as providing diagonal-style algorithms that still maintain dimension-free guarantees. We then proceed to show how a variant on this idea yields a black-box procedure for generating optimistic online learning algorithms. This yields the first optimistic regret guarantees in the unconstrained setting and generically increases adaptivity. Further, our optimistic algorithms are guaranteed to do no worse than their non-optimistic counterparts regardless of the quality of the optimistic estimates provided to the algorithm. △ Less

Submitted 24 February, 2019; originally announced February 2019.

arXiv:1901.09068 [pdf, other]

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization

Authors: Zhenxun Zhuang, Ashok Cutkosky, Francesco Orabona

Abstract: Stochastic Gradient Descent (SGD) has played a central role in machine learning. However, it requires a carefully hand-picked stepsize for fast convergence, which is notoriously tedious and time-consuming to tune. Over the last several years, a plethora of adaptive gradient-based algorithms have emerged to ameliorate this problem. They have proved efficient in reducing the labor of tuning in pract… ▽ More Stochastic Gradient Descent (SGD) has played a central role in machine learning. However, it requires a carefully hand-picked stepsize for fast convergence, which is notoriously tedious and time-consuming to tune. Over the last several years, a plethora of adaptive gradient-based algorithms have emerged to ameliorate this problem. They have proved efficient in reducing the labor of tuning in practice, but many of them lack theoretic guarantees even in the convex setting. In this paper, we propose new surrogate losses to cast the problem of learning the optimal stepsizes for the stochastic optimization of a non-convex smooth objective function onto an online convex optimization problem. This allows the use of no-regret online algorithms to compute optimal stepsizes on the fly. In turn, this results in a SGD algorithm with self-tuned stepsizes that guarantees convergence rates that are automatically adaptive to the level of noise. △ Less

Submitted 7 June, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

arXiv:1802.06293 [pdf, ps, other]

Black-Box Reductions for Parameter-free Online Learning in Banach Spaces

Authors: Ashok Cutkosky, Francesco Orabona

Abstract: We introduce several new black-box reductions that significantly improve the design of adaptive and parameter-free online learning algorithms by simplifying analysis, improving regret guarantees, and sometimes even improving runtime. We reduce parameter-free online learning to online exp-concave optimization, we reduce optimization in a Banach space to one-dimensional optimization, and we reduce o… ▽ More We introduce several new black-box reductions that significantly improve the design of adaptive and parameter-free online learning algorithms by simplifying analysis, improving regret guarantees, and sometimes even improving runtime. We reduce parameter-free online learning to online exp-concave optimization, we reduce optimization in a Banach space to one-dimensional optimization, and we reduce optimization over a constrained domain to unconstrained optimization. All of our reductions run as fast as online gradient descent. We use our new techniques to improve upon the previously best regret bounds for parameter-free learning, and do so for arbitrary norms. △ Less

Submitted 25 June, 2018; v1 submitted 17 February, 2018; originally announced February 2018.

Comments: Appears in Conference on Learning Theory 2018

arXiv:0901.1678 [pdf, ps, other]

Associated Primes of the Square of the Alexander Dual of Hypergraphs

Authors: Ashok Cutkosky

Abstract: The purpose of this paper is to provide methods for determining the associated primes of the square of the Alexander dual of the edge ideal for an m-hypergraph H. We prove a general method for detecting associated primes of the square of the Alexander dual of the edge ideal based on combinatorial conditions on the m-hypergraph. Also, we demonstrate a more efficient combinatorial criterion for de… ▽ More The purpose of this paper is to provide methods for determining the associated primes of the square of the Alexander dual of the edge ideal for an m-hypergraph H. We prove a general method for detecting associated primes of the square of the Alexander dual of the edge ideal based on combinatorial conditions on the m-hypergraph. Also, we demonstrate a more efficient combinatorial criterion for detecting the non-existence of non-minimal associated primes. In investigating 3-hypergraphs, we prove a surprising extension of the previously discovered results for 2-hypergraphs (simple graphs). For 2-hypergraphs, associated primes of the square of the Alexander dual of the edge ideal are either of height 2 or of odd height greater than 2. However, we prove that in the 3-hypergraph case, there is no such restriction - or indeed any restriction - on the heights of the associated primes. Further, we generalize this result to any dimension greater than 3. Specifically, given any integers m, q, and n with 3\leq m\leq q\leq n, we construct a m-hypergraph of size n with an associated prime of height q. We further prove that it is possible to construct connected m-hypergraphs under the same conditions. △ Less

Submitted 12 January, 2009; originally announced January 2009.

Comments: 12 pages

MSC Class: 13F55; 05E99

Showing 1–21 of 21 results for author: Cutkosky, A