Search | arXiv e-print repository

Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD

Authors: Matthew Faw, Litu Rout, Constantine Caramanis, Sanjay Shakkottai

Abstract: This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pe… ▽ More This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR'20). Empirical evidence suggests that these functions more closely captures practical machine learning problems as compared to the pervasive $L_0$-smoothness. This class is rich enough to include highly non-smooth functions, such as $\exp(L_1 x)$ which is $(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the $L_0$-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate. We develop a technique that allows us to prove $\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for $(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stop** time $τ$ which is simultaneously "large" on average, yet also allows us to treat the adaptive step sizes before $τ$ as (roughly) independent of the gradients. For general $(L_0,L_1)$-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter $σ_1 < 1$. For a broad subclass of $(L_0,L_1)$-smooth functions, our convergence rate continues to hold when $σ_1 \geq 1$. By contrast, we prove that many algorithms analyzed by prior works on $(L_0,L_1)$-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when $σ_1 > 1$. △ Less

Submitted 13 February, 2023; originally announced February 2023.

arXiv:2202.05791 [pdf, other]

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

Authors: Matthew Faw, Isidoros Tziotis, Constantine Caramanis, Aryan Mokhtari, Sanjay Shakkottai, Rachel Ward

Abstract: We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following a… ▽ More We study convergence rates of AdaGrad-Norm as an exemplar of adaptive stochastic gradient methods (SGD), where the step sizes change based on observed stochastic gradients, for minimizing non-convex, smooth objectives. Despite their popularity, the analysis of adaptive SGD lags behind that of non adaptive methods in this setting. Specifically, all prior works rely on some subset of the following assumptions: (i) uniformly-bounded gradient norms, (ii) uniformly-bounded stochastic gradient variance (or even noise support), (iii) conditional independence between the step size and stochastic gradient. In this work, we show that AdaGrad-Norm exhibits an order optimal convergence rate of $\mathcal{O}\left(\frac{\mathrm{poly}\log(T)}{\sqrt{T}}\right)$ after $T$ iterations under the same assumptions as optimally-tuned non adaptive SGD (unbounded gradient norms and affine noise variance scaling), and crucially, without needing any tuning parameters. We thus establish that adaptive gradient methods exhibit order-optimal convergence in much broader regimes than previously understood. △ Less

Submitted 25 July, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

Comments: Accepted to COLT 2022

arXiv:2111.03174 [pdf, ps, other]

doi 10.1137/1.9781611977073

Single-Sample Prophet Inequalities via Greedy-Ordered Selection

Authors: Constantine Caramanis, Paul Dütting, Matthew Faw, Federico Fusco, Philip Lazos, Stefano Leonardi, Orestis Papadigenopoulos, Emmanouil Pountourakis, Rebecca Reiffenhäuser

Abstract: We study single-sample prophet inequalities (SSPIs), i.e., prophet inequalities where only a single sample from each prior distribution is available. Besides a direct, and optimal, SSPI for the basic single choice problem [Rubinstein et al., 2020], most existing SSPI results were obtained via an elegant, but inherently lossy, reduction to order-oblivious secretary (OOS) policies [Azar et al., 2014… ▽ More We study single-sample prophet inequalities (SSPIs), i.e., prophet inequalities where only a single sample from each prior distribution is available. Besides a direct, and optimal, SSPI for the basic single choice problem [Rubinstein et al., 2020], most existing SSPI results were obtained via an elegant, but inherently lossy, reduction to order-oblivious secretary (OOS) policies [Azar et al., 2014]. Motivated by this discrepancy, we develop an intuitive and versatile greedy-based technique that yields SSPIs directly rather than through the reduction to OOSs. Our results can be seen as generalizing and unifying a number of existing results in the area of prophet and secretary problems. Our algorithms significantly improve on the competitive guarantees for a number of interesting scenarios (including general matching with edge arrivals, bipartite matching with vertex arrivals, and certain matroids), and capture new settings (such as budget additive combinatorial auctions). Complementing our algorithmic results, we also consider mechanism design variants. Finally, we analyze the power and limitations of different SSPI approaches by providing a partial converse to the reduction from SSPI to OOS given by Azar et al. △ Less

Submitted 15 March, 2024; v1 submitted 4 November, 2021; originally announced November 2021.

Comments: Merges and extends arXiv:2103.13089 [cs.GT] and arXiv:2104.02050 [cs.DS]

Journal ref: Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2022)

arXiv:2103.13089 [pdf, ps, other]

Single-Sample Prophet Inequalities Revisited

Authors: Constantine Caramanis, Matthew Faw, Orestis Papadigenopoulos, Emmanouil Pountourakis

Abstract: The study of the prophet inequality problem in the limited information regime was initiated by Azar et al. [SODA'14] in the pursuit of prior-independent posted-price mechanisms. As they show, $O(1)$-competitive policies are achievable using only a single sample from the distribution of each agent. A notable portion of their results relies on reducing the design of single-sample prophet inequalitie… ▽ More The study of the prophet inequality problem in the limited information regime was initiated by Azar et al. [SODA'14] in the pursuit of prior-independent posted-price mechanisms. As they show, $O(1)$-competitive policies are achievable using only a single sample from the distribution of each agent. A notable portion of their results relies on reducing the design of single-sample prophet inequalities (SSPIs) to that of order-oblivious secretary (OOS) policies. The above reduction comes at the cost of not fully utilizing the available samples. However, to date, this is essentially the only method for proving SSPIs for many combinatorial sets. Very recently, Rubinstein et al. [ITCS'20] give a surprisingly simple algorithm which achieves the optimal competitive ratio for the single-choice SSPI problem $-$ a result which is unobtainable going through the reduction to secretary problems. Motivated by this discrepancy, we study the competitiveness of simple SSPI policies directly, without appealing to results from OOS literature. In this direction, we first develop a framework for analyzing policies against a greedy-like prophet solution. Using this framework, we obtain the first SSPI for general (non-bipartite) matching environments, as well as improved competitive ratios for transversal and truncated partition matroids. Second, motivated by the observation that many OOS policies for matroids decompose the problem into independent rank-$1$ instances, we provide a meta-theorem which applies to any matroid satisfying this partition property. Leveraging the recent results by Rubinstein et al., we obtain improved competitive guarantees (most by a factor of $2$) for a number of matroids captured by the reduction of Azar et al. Finally, we discuss applications of our SSPIs to the design of mechanisms for multi-dimensional limited information settings with improved revenue and welfare guarantees. △ Less

Submitted 22 November, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: See arXiv:2111.03174 for the latest version of this paper (merged with arXiv:2104.02050)

arXiv:1907.10154 [pdf, other]

Mix and Match: An Optimistic Tree-Search Approach for Learning Models from Mixture Distributions

Authors: Matthew Faw, Rajat Sen, Karthikeyan Shanmugam, Constantine Caramanis, Sanjay Shakkottai

Abstract: We consider a covariate shift problem where one has access to several different training datasets for the same learning problem and a small validation set which possibly differs from all the individual training distributions. This covariate shift is caused, in part, due to unobserved features in the datasets. The objective, then, is to find the best mixture distribution over the training datasets… ▽ More We consider a covariate shift problem where one has access to several different training datasets for the same learning problem and a small validation set which possibly differs from all the individual training distributions. This covariate shift is caused, in part, due to unobserved features in the datasets. The objective, then, is to find the best mixture distribution over the training datasets (with only observed features) such that training a learning algorithm using this mixture has the best validation performance. Our proposed algorithm, ${\sf Mix\&Match}$, combines stochastic gradient descent (SGD) with optimistic tree search and model re-use (evolving partially trained models with samples from different mixture distributions) over the space of mixtures, for this task. We prove simple regret guarantees for our algorithm with respect to recovering the optimal mixture, given a total budget of SGD evaluations. Finally, we validate our algorithm on two real-world datasets. △ Less

Submitted 14 July, 2020; v1 submitted 23 July, 2019; originally announced July 2019.

Comments: New from previous version: Adds Acknowledgements section

Showing 1–5 of 5 results for author: Faw, M