-
Lexicographic Optimization: Algorithms and Stability
Authors:
Jacob Abernethy,
Robert E. Schapire,
Umar Syed
Abstract:
A lexicographic maximum of a set $X \subseteq \mathbb{R}^n$ is a vector in $X$ whose smallest component is as large as possible, and subject to that requirement, whose second smallest component is as large as possible, and so on for the third smallest component, etc. Lexicographic maximization has numerous practical and theoretical applications, including fair resource allocation, analyzing the im…
▽ More
A lexicographic maximum of a set $X \subseteq \mathbb{R}^n$ is a vector in $X$ whose smallest component is as large as possible, and subject to that requirement, whose second smallest component is as large as possible, and so on for the third smallest component, etc. Lexicographic maximization has numerous practical and theoretical applications, including fair resource allocation, analyzing the implicit regularization of learning algorithms, and characterizing refinements of game-theoretic equilibria. We prove that a minimizer in $X$ of the exponential loss function $L_c(\mathbf{x}) = \sum_i \exp(-c x_i)$ converges to a lexicographic maximum of $X$ as $c \rightarrow \infty$, provided that $X$ is stable in the sense that a well-known iterative method for finding a lexicographic maximum of $X$ cannot be made to fail simply by reducing the required quality of each iterate by an arbitrarily tiny degree. Our result holds for both near and exact minimizers of the exponential loss, while earlier convergence results made much stronger assumptions about the set $X$ and only held for the exact minimizer. We are aware of no previous results showing a connection between the iterative method for computing a lexicographic maximum and exponential loss minimization. We show that every convex polytope is stable, but that there exist compact, convex sets that are not stable. We also provide the first analysis of the convergence rate of an exponential loss minimizer (near or exact) and discover a curious dichotomy: While the two smallest components of the vector converge to the lexicographically maximum values very quickly (at roughly the rate $\frac{\log n}{c}$), all other components can converge arbitrarily slowly.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Extragradient Type Methods for Riemannian Variational Inequality Problems
Authors:
Zihao Hu,
Guanghui Wang,
Xi Wang,
Andre Wibisono,
Jacob Abernethy,
Molei Tao
Abstract:
Riemannian convex optimization and minimax optimization have recently drawn considerable attention. Their appeal lies in their capacity to adeptly manage the non-convexity of the objective function as well as constraints inherent in the feasible set in the Euclidean sense. In this work, we delve into monotone Riemannian Variational Inequality Problems (RVIPs), which encompass both Riemannian conve…
▽ More
Riemannian convex optimization and minimax optimization have recently drawn considerable attention. Their appeal lies in their capacity to adeptly manage the non-convexity of the objective function as well as constraints inherent in the feasible set in the Euclidean sense. In this work, we delve into monotone Riemannian Variational Inequality Problems (RVIPs), which encompass both Riemannian convex optimization and minimax optimization as particular cases. In the context of Euclidean space, it is established that the last-iterates of both the extragradient (EG) and past extragradient (PEG) methods converge to the solution of monotone variational inequality problems at a rate of $O\left(\frac{1}{\sqrt{T}}\right)$ (Cai et al., 2022). However, analogous behavior on Riemannian manifolds remains an open question. To bridge this gap, we introduce the Riemannian extragradient (REG) and Riemannian past extragradient (RPEG) methods. We demonstrate that both exhibit $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence. Additionally, we show that the average-iterate convergence of both REG and RPEG is $O\left(\frac{1}{T}\right)$, aligning with observations in the Euclidean case (Mokhtari et al., 2020). These results are enabled by judiciously addressing the holonomy effect so that additional complications in Riemannian cases can be reduced and the Euclidean proof inspired by the performance estimation problem (PEP) technique or the sum-of-squares (SOS) technique can be applied again.
△ Less
Submitted 1 June, 2024; v1 submitted 25 September, 2023;
originally announced September 2023.
-
On the Robustness of Epoch-Greedy in Multi-Agent Contextual Bandit Mechanisms
Authors:
Yinglun Xu,
Bhuvesh Kumar,
Jacob Abernethy
Abstract:
Efficient learning in multi-armed bandit mechanisms such as pay-per-click (PPC) auctions typically involves three challenges: 1) inducing truthful bidding behavior (incentives), 2) using personalization in the users (context), and 3) circumventing manipulations in click patterns (corruptions). Each of these challenges has been studied orthogonally in the literature; incentives have been addressed…
▽ More
Efficient learning in multi-armed bandit mechanisms such as pay-per-click (PPC) auctions typically involves three challenges: 1) inducing truthful bidding behavior (incentives), 2) using personalization in the users (context), and 3) circumventing manipulations in click patterns (corruptions). Each of these challenges has been studied orthogonally in the literature; incentives have been addressed by a line of work on truthful multi-armed bandit mechanisms, context has been extensively tackled by contextual bandit algorithms, while corruptions have been discussed via a recent line of work on bandits with adversarial corruptions. Since these challenges co-exist, it is important to understand the robustness of each of these approaches in addressing the other challenges, provide algorithms that can handle all simultaneously, and highlight inherent limitations in this combination. In this work, we show that the most prominent contextual bandit algorithm, $ε$-greedy can be extended to handle the challenges introduced by strategic arms in the contextual multi-arm bandit mechanism setting. We further show that $ε$-greedy is inherently robust to adversarial data corruption attacks and achieves performance that degrades linearly with the amount of corruption.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Randomized Quantization is All You Need for Differential Privacy in Federated Learning
Authors:
Yeojoon Youn,
Zihao Hu,
Juba Ziani,
Jacob Abernethy
Abstract:
Federated learning (FL) is a common and practical framework for learning a machine model in a decentralized fashion. A primary motivation behind this decentralized approach is data privacy, ensuring that the learner never sees the data of each local source itself. Federated learning then comes with two majors challenges: one is handling potentially complex model updates between a server and a larg…
▽ More
Federated learning (FL) is a common and practical framework for learning a machine model in a decentralized fashion. A primary motivation behind this decentralized approach is data privacy, ensuring that the learner never sees the data of each local source itself. Federated learning then comes with two majors challenges: one is handling potentially complex model updates between a server and a large number of data sources; the other is that de-centralization may, in fact, be insufficient for privacy, as the local updates themselves can reveal information about the sources' data. To address these issues, we consider an approach to federated learning that combines quantization and differential privacy. Absent privacy, Federated Learning often relies on quantization to reduce communication complexity. We build upon this approach and develop a new algorithm called the \textbf{R}andomized \textbf{Q}uantization \textbf{M}echanism (RQM), which obtains privacy through a two-levels of randomization. More precisely, we randomly sub-sample feasible quantization levels, then employ a randomized rounding procedure using these sub-sampled discrete levels. We are able to establish that our results preserve ``Renyi differential privacy'' (Renyi DP). We empirically study the performance of our algorithm and demonstrate that compared to previous work it yields improved privacy-accuracy trade-offs for DP federated learning. To the best of our knowledge, this is the first study that solely relies on randomized quantization without incorporating explicit discrete noise to achieve Renyi DP guarantees in Federated Learning systems.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Riemannian Projection-free Online Learning
Authors:
Zihao Hu,
Guanghui Wang,
Jacob Abernethy
Abstract:
The projection operation is a critical component in a wide range of optimization algorithms, such as online gradient descent (OGD), for enforcing constraints and achieving optimal regret bounds. However, it suffers from computational complexity limitations in high-dimensional settings or when dealing with ill-conditioned constraint sets. Projection-free algorithms address this issue by replacing t…
▽ More
The projection operation is a critical component in a wide range of optimization algorithms, such as online gradient descent (OGD), for enforcing constraints and achieving optimal regret bounds. However, it suffers from computational complexity limitations in high-dimensional settings or when dealing with ill-conditioned constraint sets. Projection-free algorithms address this issue by replacing the projection oracle with more efficient optimization subroutines. But to date, these methods have been developed primarily in the Euclidean setting, and while there has been growing interest in optimization on Riemannian manifolds, there has been essentially no work in trying to utilize projection-free tools here. An apparent issue is that non-trivial affine functions are generally non-convex in such domains. In this paper, we present methods for obtaining sub-linear regret guarantees in online geodesically convex optimization on curved spaces for two scenarios: when we have access to (a) a separation oracle or (b) a linear optimization oracle. For geodesically convex losses, and when a separation oracle is available, our algorithms achieve $O(T^{1/2}\:)$ and $O(T^{3/4}\;)$ adaptive regret guarantees in the full information setting and the bandit setting, respectively. When a linear optimization oracle is available, we obtain regret rates of $O(T^{3/4}\;)$ for geodesically convex losses and $O(T^{2/3}\; log T )$ for strongly geodesically convex losses.
△ Less
Submitted 1 June, 2024; v1 submitted 30 May, 2023;
originally announced May 2023.
-
Faster Margin Maximization Rates for Generic and Adversarially Robust Optimization Methods
Authors:
Guanghui Wang,
Zihao Hu,
Claudio Gentile,
Vidya Muthukumar,
Jacob Abernethy
Abstract:
First-order optimization methods tend to inherently favor certain solutions over others when minimizing an underdetermined training objective that has multiple global optima. This phenomenon, known as implicit bias, plays a critical role in understanding the generalization capabilities of optimization algorithms. Recent research has revealed that in separable binary classification tasks gradient-d…
▽ More
First-order optimization methods tend to inherently favor certain solutions over others when minimizing an underdetermined training objective that has multiple global optima. This phenomenon, known as implicit bias, plays a critical role in understanding the generalization capabilities of optimization algorithms. Recent research has revealed that in separable binary classification tasks gradient-descent-based methods exhibit an implicit bias for the $\ell_2$-maximal margin classifier. Similarly, generic optimization methods, such as mirror descent and steepest descent, have been shown to converge to maximal margin classifiers defined by alternative geometries. While gradient-descent-based algorithms provably achieve fast implicit bias rates, corresponding rates in the literature for generic optimization methods are relatively slow. To address this limitation, we present a series of state-of-the-art implicit bias rates for mirror descent and steepest descent algorithms. Our primary technique involves transforming a generic optimization algorithm into an online optimization dynamic that solves a regularized bilinear game, providing a unified framework for analyzing the implicit bias of various optimization methods. Our accelerated rates are derived by leveraging the regret bounds of online learning algorithms within this game framework. We then show the flexibility of this framework by analyzing the implicit bias in adversarial training, and again obtain significantly improved convergence rates.
△ Less
Submitted 7 April, 2024; v1 submitted 27 May, 2023;
originally announced May 2023.
-
A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks
Authors:
Jacob Abernethy,
Alekh Agarwal,
Teodor V. Marinov,
Manfred K. Warmuth
Abstract:
We study the phenomenon of \textit{in-context learning} (ICL) exhibited by large language models, where they can adapt to a new learning task, given a handful of labeled examples, without any explicit parameter optimization. Our goal is to explain how a pre-trained transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We posit a…
▽ More
We study the phenomenon of \textit{in-context learning} (ICL) exhibited by large language models, where they can adapt to a new learning task, given a handful of labeled examples, without any explicit parameter optimization. Our goal is to explain how a pre-trained transformer model is able to perform ICL under reasonable assumptions on the pre-training process and the downstream tasks. We posit a mechanism whereby a transformer can achieve the following: (a) receive an i.i.d. sequence of examples which have been converted into a prompt using potentially-ambiguous delimiters, (b) correctly segment the prompt into examples and labels, (c) infer from the data a \textit{sparse linear regressor} hypothesis, and finally (d) apply this hypothesis on the given test example and return a predicted label. We establish that this entire procedure is implementable using the transformer mechanism, and we give sample complexity guarantees for this learning framework. Our empirical findings validate the challenge of segmentation, and we show a correspondence between our posited mechanisms and observed attention maps for step (c).
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Minimizing Dynamic Regret on Geodesic Metric Spaces
Authors:
Zihao Hu,
Guanghui Wang,
Jacob Abernethy
Abstract:
In this paper, we consider the sequential decision problem where the goal is to minimize the general dynamic regret on a complete Riemannian manifold. The task of offline optimization on such a domain, also known as a geodesic metric space, has recently received significant attention. The online setting has received significantly less attention, and it has remained an open question whether the bod…
▽ More
In this paper, we consider the sequential decision problem where the goal is to minimize the general dynamic regret on a complete Riemannian manifold. The task of offline optimization on such a domain, also known as a geodesic metric space, has recently received significant attention. The online setting has received significantly less attention, and it has remained an open question whether the body of results that hold in the Euclidean setting can be transplanted into the land of Riemannian manifolds where new challenges (e.g., curvature) come into play. In this paper, we show how to get optimistic regret bound on manifolds with non-positive curvature whenever improper learning is allowed and propose an array of adaptive no-regret algorithms. To the best of our knowledge, this is the first work that considers general dynamic regret and develops "optimistic" online learning algorithms which can be employed on geodesic metric spaces.
△ Less
Submitted 5 July, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Adaptive Oracle-Efficient Online Learning
Authors:
Guanghui Wang,
Zihao Hu,
Vidya Muthukumar,
Jacob Abernethy
Abstract:
The classical algorithms for online learning and decision-making have the benefit of achieving the optimal performance guarantees, but suffer from computational complexity limitations when implemented at scale. More recent sophisticated techniques, which we refer to as oracle-efficient methods, address this problem by dispatching to an offline optimization oracle that can search through an exponen…
▽ More
The classical algorithms for online learning and decision-making have the benefit of achieving the optimal performance guarantees, but suffer from computational complexity limitations when implemented at scale. More recent sophisticated techniques, which we refer to as oracle-efficient methods, address this problem by dispatching to an offline optimization oracle that can search through an exponentially-large (or even infinite) space of decisions and select that which performed the best on any dataset. But despite the benefits of computational feasibility, oracle-efficient algorithms exhibit one major limitation: while performing well in worst-case settings, they do not adapt well to friendly environments. In this paper we consider two such friendly scenarios, (a) "small-loss" problems and (b) IID data. We provide a new framework for designing follow-the-perturbed-leader algorithms that are oracle-efficient and adapt well to the small-loss environment, under a particular condition which we call approximability (which is spiritually related to sufficient conditions provided by Dudík et al., [2020]). We identify a series of real-world settings, including online auctions and transductive online classification, for which approximability holds. We also extend the algorithm to an IID data setting and establish a "best-of-both-worlds" bound in the oracle-efficient setting.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
On Accelerated Perceptrons and Beyond
Authors:
Guanghui Wang,
Rafael Hanashiro,
Etash Guha,
Jacob Abernethy
Abstract:
The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify $n$ linearly separable data points, assuming the classes are separated by some margin $γ> 0$. A foundational result is that Perceptron converges after $Ω(1/γ^{2})$ iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to…
▽ More
The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify $n$ linearly separable data points, assuming the classes are separated by some margin $γ> 0$. A foundational result is that Perceptron converges after $Ω(1/γ^{2})$ iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to $Ω(\sqrt{\log n}/γ)$, with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also lead to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) For the margin maximization problem, we improve the state-of-the-art result from $O(\log t/t^2)$ to $O(1/t^2)$, where $t$ is the number of iterations; b) We provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an $O(1/t^2)$ rate; c) For the classical $p$-norm Perceptron problem, we provide an algorithm with $Ω(\sqrt{(p-1)\log n}/γ)$ convergence rate, while existing algorithms suffer the $Ω({(p-1)}/γ^2)$ convergence rate.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
No-Regret Dynamics in the Fenchel Game: A Unified Framework for Algorithmic Convex Optimization
Authors:
Jun-Kun Wang,
Jacob Abernethy,
Kfir Y. Levy
Abstract:
We develop an algorithmic framework for solving convex optimization problems using no-regret game dynamics. By converting the problem of minimizing a convex function into an auxiliary problem of solving a min-max game in a sequential fashion, we can consider a range of strategies for each of the two-players who must select their actions one after the other. A common choice for these strategies are…
▽ More
We develop an algorithmic framework for solving convex optimization problems using no-regret game dynamics. By converting the problem of minimizing a convex function into an auxiliary problem of solving a min-max game in a sequential fashion, we can consider a range of strategies for each of the two-players who must select their actions one after the other. A common choice for these strategies are so-called no-regret learning algorithms, and we describe a number of such and prove bounds on their regret. We then show that many classical first-order methods for convex optimization -- including average-iterate gradient descent, the Frank-Wolfe algorithm, Nesterov's acceleration methods, and the accelerated proximal method -- can be interpreted as special cases of our framework as long as each player makes the correct choice of no-regret strategy. Proving convergence rates in this framework becomes very straightforward, as they follow from plugging in the appropriate known regret bounds. Our framework also gives rise to a number of new first-order methods for special cases of convex optimization that were not previously known.
△ Less
Submitted 18 February, 2023; v1 submitted 22 November, 2021;
originally announced November 2021.
-
Esca** Saddle Points Faster with Stochastic Momentum
Authors:
Jun-Kun Wang,
Chi-Heng Lin,
Jacob Abernethy
Abstract:
Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in…
▽ More
Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the gradient at the current iterate on a batch of examples, where the addition of a ``momentum'' term biases the update in the direction of the previous change in parameters. In non-stochastic convex optimization one can show that a momentum adjustment provably reduces convergence time in many settings, yet such results have been elusive in the stochastic and non-convex settings. At the same time, a widely-observed empirical phenomenon is that in training deep networks stochastic momentum appears to significantly improve convergence time, variants of it have flourished in the development of other popular update methods, e.g. ADAM [KB15], AMSGrad [RKK18], etc. Yet theoretical justification for the use of stochastic momentum has remained a significant open question. In this paper we propose an answer: stochastic momentum improves deep network training because it modifies SGD to escape saddle points faster and, consequently, to more quickly find a second order stationary point. Our theoretical results also shed light on the related question of how to choose the ideal momentum parameter--our analysis suggests that $β\in [0,1)$ should be large (close to 1), which comports with empirical findings. We also provide experimental findings that further validate these conclusions.
△ Less
Submitted 5 June, 2021;
originally announced June 2021.
-
A Multiclass Boosting Framework for Achieving Fast and Provable Adversarial Robustness
Authors:
Jacob Abernethy,
Pranjal Awasthi,
Satyen Kale
Abstract:
Alongside the well-publicized accomplishments of deep neural networks there has emerged an apparent bug in their success on tasks such as object recognition: with deep models trained using vanilla methods, input images can be slightly corrupted in order to modify output predictions, even when these corruptions are practically invisible. This apparent lack of robustness has led researchers to propo…
▽ More
Alongside the well-publicized accomplishments of deep neural networks there has emerged an apparent bug in their success on tasks such as object recognition: with deep models trained using vanilla methods, input images can be slightly corrupted in order to modify output predictions, even when these corruptions are practically invisible. This apparent lack of robustness has led researchers to propose methods that can help to prevent an adversary from having such capabilities. The state-of-the-art approaches have incorporated the robustness requirement into the loss function, and the training process involves taking stochastic gradient descent steps not using original inputs but on adversarially-corrupted ones. In this paper we propose a multiclass boosting framework to ensure adversarial robustness. Boosting algorithms are generally well-suited for adversarial scenarios, as they were classically designed to satisfy a minimax guarantee. We provide a theoretical foundation for this methodology and describe conditions under which robustness can be achieved given a weak training oracle. We show empirically that adversarially-robust multiclass boosting not only outperforms the state-of-the-art methods, it does so at a fraction of the training time.
△ Less
Submitted 3 March, 2021; v1 submitted 1 March, 2021;
originally announced March 2021.
-
Linear Separation via Optimism
Authors:
Rafael Hanashiro,
Jacob Abernethy
Abstract:
Binary linear classification has been explored since the very early days of the machine learning literature. Perhaps the most classical algorithm is the Perceptron, where a weight vector used to classify examples is maintained, and additive updates are made as incorrect examples are discovered. The Perceptron has been thoroughly studied and several versions have been proposed over many decades. Th…
▽ More
Binary linear classification has been explored since the very early days of the machine learning literature. Perhaps the most classical algorithm is the Perceptron, where a weight vector used to classify examples is maintained, and additive updates are made as incorrect examples are discovered. The Perceptron has been thoroughly studied and several versions have been proposed over many decades. The key theoretical fact about the Perceptron is that, so long as a perfect linear classifier exists with some margin $γ> 0$, the number of required updates to find such a perfect linear separator is bounded by $\frac{1}{γ^2}$. What has never been fully addressed is: does there exist an algorithm that can achieve this with fewer updates? In this paper we answer this in the affirmative: we propose the Optimistic Perceptron algorithm, a simple procedure that finds a separating hyperplane in no more than $\frac{1}γ$ updates. We also show experimentally that this procedure can significantly outperform Perceptron.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Understanding How Over-Parametrization Leads to Acceleration: A case of learning a single teacher neuron
Authors:
Jun-Kun Wang,
Jacob Abernethy
Abstract:
Over-parametrization has become a popular technique in deep learning. It is observed that by over-parametrization, a larger neural network needs a fewer training iterations than a smaller one to achieve a certain level of performance -- namely, over-parametrization leads to acceleration in optimization. However, despite that over-parametrization is widely used nowadays, little theory is available…
▽ More
Over-parametrization has become a popular technique in deep learning. It is observed that by over-parametrization, a larger neural network needs a fewer training iterations than a smaller one to achieve a certain level of performance -- namely, over-parametrization leads to acceleration in optimization. However, despite that over-parametrization is widely used nowadays, little theory is available to explain the acceleration due to over-parametrization. In this paper, we propose understanding it by studying a simple problem first. Specifically, we consider the setting that there is a single teacher neuron with quadratic activation, where over-parametrization is realized by having multiple student neurons learn the data generated from the teacher neuron. We provably show that over-parametrization helps the iterate generated by gradient descent to enter the neighborhood of a global optimal solution that achieves zero testing error faster. On the other hand, we also point out an issue regarding the necessity of over-parametrization and study how the scaling of the output neurons affects the convergence time.
△ Less
Submitted 27 September, 2021; v1 submitted 4 October, 2020;
originally announced October 2020.
-
A Modular Analysis of Provable Acceleration via Polyak's Momentum: Training a Wide ReLU Network and a Deep Linear Network
Authors:
Jun-Kun Wang,
Chi-Heng Lin,
Jacob Abernethy
Abstract:
Incorporating a so-called "momentum" dynamic in gradient descent methods is widely used in neural net training as it has been broadly observed that, at least empirically, it often leads to significantly faster convergence. At the same time, there are very few theoretical guarantees in the literature to explain this apparent acceleration effect. Even for the classical strongly convex quadratic prob…
▽ More
Incorporating a so-called "momentum" dynamic in gradient descent methods is widely used in neural net training as it has been broadly observed that, at least empirically, it often leads to significantly faster convergence. At the same time, there are very few theoretical guarantees in the literature to explain this apparent acceleration effect. Even for the classical strongly convex quadratic problems, several existing results only show Polyak's momentum has an accelerated linear rate asymptotically. In this paper, we first revisit the quadratic problems and show a non-asymptotic accelerated linear rate of Polyak's momentum. Then, we provably show that Polyak's momentum achieves acceleration for training a one-layer wide ReLU network and a deep linear network, which are perhaps the two most popular canonical models for studying optimization and deep learning in the literature. Prior work Du at al. 2019 and Wu et al. 2019 showed that using vanilla gradient descent, and with an use of over-parameterization, the error decays as $(1- Θ(\frac{1}{ κ'}))^t$ after $t$ iterations, where $κ'$ is the condition number of a Gram Matrix. Our result shows that with the appropriate choice of parameters Polyak's momentum has a rate of $(1-Θ(\frac{1}{\sqrt{κ'}}))^t$. For the deep linear network, prior work Hu et al. 2020 showed that vanilla gradient descent has a rate of $(1-Θ(\frac{1}κ))^t$, where $κ$ is the condition number of a data matrix. Our result shows an acceleration rate $(1- Θ(\frac{1}{\sqrtκ}))^t$ is achievable by Polyak's momentum. All the results in this work are obtained from a modular analysis, which can be of independent interest. This work establishes that momentum does indeed speed up neural net training.
△ Less
Submitted 10 June, 2021; v1 submitted 4 October, 2020;
originally announced October 2020.
-
Quickly Finding a Benign Region via Heavy Ball Momentum in Non-Convex Optimization
Authors:
Jun-Kun Wang,
Jacob Abernethy
Abstract:
The Heavy Ball Method, proposed by Polyak over five decades ago, is a first-order method for optimizing continuous functions. While its stochastic counterpart has proven extremely popular in training deep networks, there are almost no known functions where deterministic Heavy Ball is provably faster than the simple and classical gradient descent algorithm in non-convex optimization. The success of…
▽ More
The Heavy Ball Method, proposed by Polyak over five decades ago, is a first-order method for optimizing continuous functions. While its stochastic counterpart has proven extremely popular in training deep networks, there are almost no known functions where deterministic Heavy Ball is provably faster than the simple and classical gradient descent algorithm in non-convex optimization. The success of Heavy Ball has thus far eluded theoretical understanding. Our goal is to address this gap, and in the present work we identify two non-convex problems where we provably show that the Heavy Ball momentum helps the iterate to enter a benign region that contains a global optimal point faster. We show that Heavy Ball exhibits simple dynamics that clearly reveal the benefit of using a larger value of momentum parameter for the problems. The first of these optimization problems is the phase retrieval problem, which has useful applications in physical science. The second of these optimization problems is the cubic-regularized minimization, a critical subroutine required by Nesterov-Polyak cubic-regularized method to find second-order stationary points in general smooth non-convex problems.
△ Less
Submitted 14 February, 2021; v1 submitted 3 October, 2020;
originally announced October 2020.
-
Online Kernel based Generative Adversarial Networks
Authors:
Yeojoon Youn,
Neil Thistlethwaite,
Sang Keun Choe,
Jacob Abernethy
Abstract:
One of the major breakthroughs in deep learning over the past five years has been the Generative Adversarial Network (GAN), a neural network-based generative model which aims to mimic some underlying distribution given a dataset of samples. In contrast to many supervised problems, where one tries to minimize a simple objective function of the parameters, GAN training is formulated as a min-max pro…
▽ More
One of the major breakthroughs in deep learning over the past five years has been the Generative Adversarial Network (GAN), a neural network-based generative model which aims to mimic some underlying distribution given a dataset of samples. In contrast to many supervised problems, where one tries to minimize a simple objective function of the parameters, GAN training is formulated as a min-max problem over a pair of network parameters. While empirically GANs have shown impressive success in several domains, researchers have been puzzled by unusual training behavior, including cycling so-called mode collapse. In this paper, we begin by providing a quantitative method to explore some of the challenges in GAN training, and we show empirically how this relates fundamentally to the parametric nature of the discriminator network. We propose a novel approach that resolves many of these issues by relying on a kernel-based non-parametric discriminator that is highly amenable to online training---we call this the Online Kernel-based Generative Adversarial Networks (OKGAN). We show empirically that OKGANs mitigate a number of training issues, including mode collapse and cycling, and are much more amenable to theoretical guarantees. OKGANs empirically perform dramatically better, with respect to reverse KL-divergence, than other GAN formulations on synthetic data; on classical vision datasets such as MNIST, SVHN, and CelebA, show comparable performance.
△ Less
Submitted 19 June, 2020;
originally announced June 2020.
-
Active Sampling for Min-Max Fairness
Authors:
Jacob Abernethy,
Pranjal Awasthi,
Matthäus Kleindessner,
Jamie Morgenstern,
Chris Russell,
Jie Zhang
Abstract:
We propose simple active sampling and reweighting strategies for optimizing min-max fairness that can be applied to any classification or regression model learned via loss minimization. The key intuition behind our approach is to use at each timestep a datapoint from the group that is worst off under the current model for updating the model. The ease of implementation and the generality of our rob…
▽ More
We propose simple active sampling and reweighting strategies for optimizing min-max fairness that can be applied to any classification or regression model learned via loss minimization. The key intuition behind our approach is to use at each timestep a datapoint from the group that is worst off under the current model for updating the model. The ease of implementation and the generality of our robust formulation make it an attractive option for improving model performance on disadvantaged groups. For convex learning problems, such as linear or logistic regression, we provide a fine-grained analysis, proving the rate of convergence to a min-max fair solution.
△ Less
Submitted 17 June, 2022; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Fast Convergence of Fictitious Play for Diagonal Payoff Matrices
Authors:
Jacob Abernethy,
Kevin A. Lai,
Andre Wibisono
Abstract:
Fictitious Play (FP) is a simple and natural dynamic for repeated play in zero-sum games. Proposed by Brown in 1949, FP was shown to converge to a Nash Equilibrium by Robinson in 1951, albeit at a slow rate that may depend on the dimension of the problem. In 1959, Karlin conjectured that FP converges at the more natural rate of $O(1/\sqrt{t})$. However, Daskalakis and Pan disproved a version of th…
▽ More
Fictitious Play (FP) is a simple and natural dynamic for repeated play in zero-sum games. Proposed by Brown in 1949, FP was shown to converge to a Nash Equilibrium by Robinson in 1951, albeit at a slow rate that may depend on the dimension of the problem. In 1959, Karlin conjectured that FP converges at the more natural rate of $O(1/\sqrt{t})$. However, Daskalakis and Pan disproved a version of this conjecture in 2014, showing that a slow rate can occur, although their result relies on adversarial tie-breaking. In this paper, we show that Karlin's conjecture is indeed correct for the class of diagonal payoff matrices, as long as ties are broken lexicographically. Specifically, we show that FP converges at a $O(1/\sqrt{t})$ rate in the case when the payoff matrix is diagonal. We also prove this bound is tight by showing a matching lower bound in the identity payoff case under the lexicographic tie-breaking assumption.
△ Less
Submitted 15 November, 2020; v1 submitted 19 November, 2019;
originally announced November 2019.
-
Competing Against Equilibria in Zero-Sum Games with Evolving Payoffs
Authors:
Adrian Rivera Cardoso,
Jacob Abernethy,
He Wang,
Huan Xu
Abstract:
We study the problem of repeated play in a zero-sum game in which the payoff matrix may change, in a possibly adversarial fashion, on each round; we call these Online Matrix Games. Finding the Nash Equilibrium (NE) of a two player zero-sum game is core to many problems in statistics, optimization, and economics, and for a fixed game matrix this can be easily reduced to solving a linear program. Bu…
▽ More
We study the problem of repeated play in a zero-sum game in which the payoff matrix may change, in a possibly adversarial fashion, on each round; we call these Online Matrix Games. Finding the Nash Equilibrium (NE) of a two player zero-sum game is core to many problems in statistics, optimization, and economics, and for a fixed game matrix this can be easily reduced to solving a linear program. But when the payoff matrix evolves over time our goal is to find a sequential algorithm that can compete with, in a certain sense, the NE of the long-term-averaged payoff matrix. We design an algorithm with small NE regret--that is, we ensure that the long-term payoff of both players is close to minimax optimum in hindsight. Our algorithm achieves near-optimal dependence with respect to the number of rounds and depends poly-logarithmically on the number of available actions of the players. Additionally, we show that the naive reduction, where each player simply minimizes its own regret, fails to achieve the stated objective regardless of which algorithm is used. We also consider the so-called bandit setting, where the feedback is significantly limited, and we provide an algorithm with small NE regret using one-point estimates of each payoff matrix.
△ Less
Submitted 17 July, 2019;
originally announced July 2019.
-
Last-iterate convergence rates for min-max optimization
Authors:
Jacob Abernethy,
Kevin A. Lai,
Andre Wibisono
Abstract:
While classic work in convex-concave min-max optimization relies on average-iterate convergence results, the emergence of nonconvex applications such as training Generative Adversarial Networks has led to renewed interest in last-iterate convergence guarantees. Proving last-iterate convergence is challenging because many natural algorithms, such as Simultaneous Gradient Descent/Ascent, provably di…
▽ More
While classic work in convex-concave min-max optimization relies on average-iterate convergence results, the emergence of nonconvex applications such as training Generative Adversarial Networks has led to renewed interest in last-iterate convergence guarantees. Proving last-iterate convergence is challenging because many natural algorithms, such as Simultaneous Gradient Descent/Ascent, provably diverge or cycle even in simple convex-concave min-max settings, and previous work on global last-iterate convergence rates has been limited to the bilinear and convex-strongly concave settings. In this work, we show that the Hamiltonian Gradient Descent (HGD) algorithm achieves linear convergence in a variety of more general settings, including convex-concave problems that satisfy a "sufficiently bilinear" condition. We also prove similar convergence rates for the Consensus Optimization (CO) algorithm of [MNG17] for some parameter settings of CO.
△ Less
Submitted 25 October, 2019; v1 submitted 5 June, 2019;
originally announced June 2019.
-
BrainNet: A Multi-Person Brain-to-Brain Interface for Direct Collaboration Between Brains
Authors:
Linxing Preston Jiang,
Andrea Stocco,
Darby M. Losey,
Justin A. Abernethy,
Chantel S. Prat,
Rajesh P. N. Rao
Abstract:
We present BrainNet which, to our knowledge, is the first multi-person non-invasive direct brain-to-brain interface for collaborative problem solving. The interface combines electroencephalography (EEG) to record brain signals and transcranial magnetic stimulation (TMS) to deliver information noninvasively to the brain. The interface allows three human subjects to collaborate and solve a task usin…
▽ More
We present BrainNet which, to our knowledge, is the first multi-person non-invasive direct brain-to-brain interface for collaborative problem solving. The interface combines electroencephalography (EEG) to record brain signals and transcranial magnetic stimulation (TMS) to deliver information noninvasively to the brain. The interface allows three human subjects to collaborate and solve a task using direct brain-to-brain communication. Two of the three subjects are "Senders" whose brain signals are decoded using real-time EEG data analysis to extract decisions about whether to rotate a block in a Tetris-like game before it is dropped to fill a line. The Senders' decisions are transmitted via the Internet to the brain of a third subject, the "Receiver," who cannot see the game screen. The decisions are delivered to the Receiver's brain via magnetic stimulation of the occipital cortex. The Receiver integrates the information received and makes a decision using an EEG interface about either turning the block or kee** it in the same position. A second round of the game gives the Senders one more chance to validate and provide feedback to the Receiver's action. We evaluated the performance of BrainNet in terms of (1) Group-level performance during the game; (2) True/False positive rates of subjects' decisions; (3) Mutual information between subjects. Five groups of three subjects successfully used BrainNet to perform the Tetris task, with an average accuracy of 0.813. Furthermore, by varying the information reliability of the Senders by artificially injecting noise into one Sender's signal, we found that Receivers are able to learn which Sender is more reliable based solely on the information transmitted to their brains. Our results raise the possibility of future brain-to-brain interfaces that enable cooperative problem solving by humans using a "social network" of connected brains.
△ Less
Submitted 22 May, 2019; v1 submitted 23 September, 2018;
originally announced September 2018.
-
Acceleration through Optimistic No-Regret Dynamics
Authors:
Jun-Kun Wang,
Jacob Abernethy
Abstract:
We consider the problem of minimizing a smooth convex function by reducing the optimization to computing the Nash equilibrium of a particular zero-sum convex-concave game. Zero-sum games can be solved using online learning dynamics, where a classical technique involves simulating two no-regret algorithms that play against each other and, after $T$ rounds, the average iterate is guaranteed to solve…
▽ More
We consider the problem of minimizing a smooth convex function by reducing the optimization to computing the Nash equilibrium of a particular zero-sum convex-concave game. Zero-sum games can be solved using online learning dynamics, where a classical technique involves simulating two no-regret algorithms that play against each other and, after $T$ rounds, the average iterate is guaranteed to solve the original optimization problem with error decaying as $O(\log T/T)$. In this paper we show that the technique can be enhanced to a rate of $O(1/T^2)$ by extending recent work \cite{RS13,SALS15} that leverages \textit{optimistic learning} to speed up equilibrium computation. The resulting optimization algorithm derived from this analysis coincides \textit{exactly} with the well-known \NA \cite{N83a} method, and indeed the same story allows us to recover several variants of the Nesterov's algorithm via small tweaks. We are also able to establish the accelerated linear rate for a function which is both strongly-convex and smooth. This methodology unifies a number of different iterative optimization methods: we show that the \HB algorithm is precisely the non-optimistic variant of \NA, and recent prior work already established a similar perspective on \FW \cite{AW17,ALLW18}.
△ Less
Submitted 15 November, 2018; v1 submitted 27 July, 2018;
originally announced July 2018.
-
ActiveRemediation: The Search for Lead Pipes in Flint, Michigan
Authors:
Jacob Abernethy,
Alex Chojnacki,
Arya Farahi,
Eric Schwartz,
Jared Webb
Abstract:
We detail our ongoing work in Flint, Michigan to detect pipes made of lead and other hazardous metals. After elevated levels of lead were detected in residents' drinking water, followed by an increase in blood lead levels in area children, the state and federal governments directed over $125 million to replace water service lines, the pipes connecting each home to the water system. In the absence…
▽ More
We detail our ongoing work in Flint, Michigan to detect pipes made of lead and other hazardous metals. After elevated levels of lead were detected in residents' drinking water, followed by an increase in blood lead levels in area children, the state and federal governments directed over $125 million to replace water service lines, the pipes connecting each home to the water system. In the absence of accurate records, and with the high cost of determining buried pipe materials, we put forth a number of predictive and procedural tools to aid in the search and removal of lead infrastructure. Alongside these statistical and machine learning approaches, we describe our interactions with government officials in recommending homes for both inspection and replacement, with a focus on the statistical model that adapts to incoming information. Finally, in light of discussions about increased spending on infrastructure development by the federal government, we explore how our approach generalizes beyond Flint to other municipalities nationwide.
△ Less
Submitted 17 August, 2018; v1 submitted 10 June, 2018;
originally announced June 2018.
-
Faster Rates for Convex-Concave Games
Authors:
Jacob Abernethy,
Kevin A. Lai,
Kfir Y. Levy,
Jun-Kun Wang
Abstract:
We consider the use of no-regret algorithms to compute equilibria for particular classes of convex-concave games. While standard regret bounds would lead to convergence rates on the order of $O(T^{-1/2})$, recent work \citep{RS13,SALS15} has established $O(1/T)$ rates by taking advantage of a particular class of optimistic prediction algorithms. In this work we go further, showing that for a parti…
▽ More
We consider the use of no-regret algorithms to compute equilibria for particular classes of convex-concave games. While standard regret bounds would lead to convergence rates on the order of $O(T^{-1/2})$, recent work \citep{RS13,SALS15} has established $O(1/T)$ rates by taking advantage of a particular class of optimistic prediction algorithms. In this work we go further, showing that for a particular class of games one achieves a $O(1/T^2)$ rate, and we show how this applies to the Frank-Wolfe method and recovers a similar bound \citep{D15}. We also show that such no-regret techniques can even achieve a linear rate, $O(\exp(-T))$, for equilibrium computation under additional curvature assumptions.
△ Less
Submitted 17 May, 2018;
originally announced May 2018.
-
Online Learning via the Differential Privacy Lens
Authors:
Jacob Abernethy,
Young Hun Jung,
Chansoo Lee,
Audra McMillan,
Ambuj Tewari
Abstract:
In this paper, we use differential privacy as a lens to examine online learning in both full and partial information settings. The differential privacy framework is, at heart, less about privacy and more about algorithmic stability, and thus has found application in domains well beyond those where information security is central. Here we develop an algorithmic property called one-step differential…
▽ More
In this paper, we use differential privacy as a lens to examine online learning in both full and partial information settings. The differential privacy framework is, at heart, less about privacy and more about algorithmic stability, and thus has found application in domains well beyond those where information security is central. Here we develop an algorithmic property called one-step differential stability which facilitates a more refined regret analysis for online learning methods. We show that tools from the differential privacy literature can yield regret bounds for many interesting online learning problems including online convex optimization and online linear optimization. Our stability notion is particularly well-suited for deriving first-order regret bounds for follow-the-perturbed-leader algorithms, something that all previous analyses have struggled to achieve. We also generalize the standard max-divergence to obtain a broader class called Tsallis max-divergences. These define stronger notions of stability that are useful in deriving bounds in partial information settings such as multi-armed bandits and bandits with experts.
△ Less
Submitted 28 October, 2019; v1 submitted 27 November, 2017;
originally announced November 2017.
-
A Data Science Approach to Understanding Residential Water Contamination in Flint
Authors:
Alex Chojnacki,
Chengyu Dai,
Arya Farahi,
Guangsha Shi,
Jared Webb,
Daniel T. Zhang,
Jacob Abernethy,
Eric Schwartz
Abstract:
When the residents of Flint learned that lead had contaminated their water system, the local government made water-testing kits available to them free of charge. The city government published the results of these tests, creating a valuable dataset that is key to understanding the causes and extent of the lead contamination event in Flint. This is the nation's largest dataset on lead in a municipal…
▽ More
When the residents of Flint learned that lead had contaminated their water system, the local government made water-testing kits available to them free of charge. The city government published the results of these tests, creating a valuable dataset that is key to understanding the causes and extent of the lead contamination event in Flint. This is the nation's largest dataset on lead in a municipal water system.
In this paper, we predict the lead contamination for each household's water supply, and we study several related aspects of Flint's water troubles, many of which generalize well beyond this one city. For example, we show that elevated lead risks can be (weakly) predicted from observable home attributes. Then we explore the factors associated with elevated lead. These risk assessments were developed in part via a crowd sourced prediction challenge at the University of Michigan. To inform Flint residents of these assessments, they have been incorporated into a web and mobile application funded by \texttt{Google.org}. We also explore questions of self-selection in the residential testing program, examining which factors are linked to when and how frequently residents voluntarily sample their water.
△ Less
Submitted 5 July, 2017;
originally announced July 2017.
-
On Convergence and Stability of GANs
Authors:
Naveen Kodali,
Jacob Abernethy,
James Hays,
Zsolt Kira
Abstract:
We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game…
▽ More
We propose studying GAN training dynamics as regret minimization, which is in contrast to the popular view that there is consistent minimization of a divergence between real and generated distributions. We analyze the convergence of GAN training from this new point of view to understand why mode collapse happens. We hypothesize the existence of undesirable local equilibria in this non-convex game to be responsible for mode collapse. We observe that these local equilibria often exhibit sharp gradients of the discriminator function around some real data points. We demonstrate that these degenerate local equilibria can be avoided with a gradient penalty scheme called DRAGAN. We show that DRAGAN enables faster training, achieves improved stability with fewer mode collapses, and leads to generator networks with better modeling performance across a variety of architectures and objective functions.
△ Less
Submitted 10 December, 2017; v1 submitted 19 May, 2017;
originally announced May 2017.
-
Data Science in Service of Performing Arts: Applying Machine Learning to Predicting Audience Preferences
Authors:
Jacob Abernethy,
Cyrus Anderson,
Alex Chojnacki,
Chengyu Dai,
John Dryden,
Eric Schwartz,
Wenbo Shen,
Jonathan Stroud,
Laura Wendlandt,
Sheng Yang,
Daniel Zhang
Abstract:
Performing arts organizations aim to enrich their communities through the arts. To do this, they strive to match their performance offerings to the taste of those communities. Success relies on understanding audience preference and predicting their behavior. Similar to most e-commerce or digital entertainment firms, arts presenters need to recommend the right performance to the right customer at t…
▽ More
Performing arts organizations aim to enrich their communities through the arts. To do this, they strive to match their performance offerings to the taste of those communities. Success relies on understanding audience preference and predicting their behavior. Similar to most e-commerce or digital entertainment firms, arts presenters need to recommend the right performance to the right customer at the right time. As part of the Michigan Data Science Team (MDST), we partnered with the University Musical Society (UMS), a non-profit performing arts presenter housed in the University of Michigan, Ann Arbor. We are providing UMS with analysis and business intelligence, utilizing historical individual-level sales data. We built a recommendation system based on collaborative filtering, gaining insights into the artistic preferences of customers, along with the similarities between performances. To better understand audience behavior, we used statistical methods from customer-base analysis. We characterized customer heterogeneity via segmentation, and we modeled customer cohorts to understand and predict ticket purchasing patterns. Finally, we combined statistical modeling with natural language processing (NLP) to explore the impact of wording in program descriptions. These ongoing efforts provide a platform to launch targeted marketing campaigns, hel** UMS carry out its mission by allocating its resources more efficiently. Celebrating its 138th season, UMS is a 2014 recipient of the National Medal of Arts, and it continues to enrich communities by connecting world-renowned artists with diverse audiences, especially students in their formative years. We aim to contribute to that mission through data science and customer analytics.
△ Less
Submitted 29 September, 2016;
originally announced November 2016.
-
Flint Water Crisis: Data-Driven Risk Assessment Via Residential Water Testing
Authors:
Jacob Abernethy,
Cyrus Anderson,
Chengyu Dai,
Arya Farahi,
Linh Nguyen,
Adam Rauh,
Eric Schwartz,
Wenbo Shen,
Guangsha Shi,
Jonathan Stroud,
Xinyu Tan,
Jared Webb,
Sheng Yang
Abstract:
Recovery from the Flint Water Crisis has been hindered by uncertainty in both the water testing process and the causes of contamination. In this work, we develop an ensemble of predictive models to assess the risk of lead contamination in individual homes and neighborhoods. To train these models, we utilize a wide range of data sources, including voluntary residential water tests, historical recor…
▽ More
Recovery from the Flint Water Crisis has been hindered by uncertainty in both the water testing process and the causes of contamination. In this work, we develop an ensemble of predictive models to assess the risk of lead contamination in individual homes and neighborhoods. To train these models, we utilize a wide range of data sources, including voluntary residential water tests, historical records, and city infrastructure data. Additionally, we use our models to identify the most prominent factors that contribute to a high risk of lead contamination. In this analysis, we find that lead service lines are not the only factor that is predictive of the risk of lead contamination of water. These results could be used to guide the long-term recovery efforts in Flint, minimize the immediate damages, and improve resource-allocation decisions for similar water infrastructure crises.
△ Less
Submitted 30 September, 2016;
originally announced October 2016.
-
A Descending Price Auction for Matching Markets
Authors:
Shih-Tang Su,
Jacob D. Abernethy,
Grant Schoenebeck,
Vijay G. Subramanian
Abstract:
This work presents a descending-price-auction algorithm to obtain the maximum market-clearing price vector (MCP) in unit-demand matching markets with m items by exploiting the combinatorial structure. With a shrewd choice of goods for which the prices are reduced in each step, the algorithm only uses the combinatorial structure, which avoids solving LPs and enjoys a strongly polynomial runtime of…
▽ More
This work presents a descending-price-auction algorithm to obtain the maximum market-clearing price vector (MCP) in unit-demand matching markets with m items by exploiting the combinatorial structure. With a shrewd choice of goods for which the prices are reduced in each step, the algorithm only uses the combinatorial structure, which avoids solving LPs and enjoys a strongly polynomial runtime of $O(m^4)$. Critical to the algorithm is determining the set of under-demanded goods for which we reduce the prices simultaneously in each step of the algorithm. This we accomplish by choosing the subset of goods that maximize a skewness function, which makes the bipartite graph series converges to the combinatorial structure at the maximum MCP in $O(m^2)$ steps. A graph coloring algorithm is proposed to find the set of goods with the maximal skewness value that yields $O(m^4)$ complexity.
△ Less
Submitted 4 November, 2017; v1 submitted 16 July, 2016;
originally announced July 2016.
-
Fighting Bandits with a New Kind of Smoothness
Authors:
Jacob Abernethy,
Chansoo Lee,
Ambuj Tewari
Abstract:
We define a novel family of algorithms for the adversarial multi-armed bandit problem, and provide a simple analysis technique based on convex smoothing. We prove two main results. First, we show that regularization via the \emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the $Θ(\sqrt{TN})$ minimax regret. Second, we show that a wide class of perturbation methods achieve a n…
▽ More
We define a novel family of algorithms for the adversarial multi-armed bandit problem, and provide a simple analysis technique based on convex smoothing. We prove two main results. First, we show that regularization via the \emph{Tsallis entropy}, which includes EXP3 as a special case, achieves the $Θ(\sqrt{TN})$ minimax regret. Second, we show that a wide class of perturbation methods achieve a near-optimal regret as low as $O(\sqrt{TN \log N})$ if the perturbation distribution has a bounded hazard rate. For example, the Gumbel, Weibull, Frechet, Pareto, and Gamma distributions all satisfy this key property.
△ Less
Submitted 13 December, 2015;
originally announced December 2015.
-
Rate of Price Discovery in Iterative Combinatorial Auctions
Authors:
Jacob Abernethy,
Sébastien Lahaie,
Matus Telgarsky
Abstract:
We study a class of iterative combinatorial auctions which can be viewed as subgradient descent methods for the problem of pricing bundles to balance supply and demand. We provide concrete convergence rates for auctions in this class, bounding the number of auction rounds needed to reach clearing prices. Our analysis allows for a variety of pricing schemes, including item, bundle, and polynomial p…
▽ More
We study a class of iterative combinatorial auctions which can be viewed as subgradient descent methods for the problem of pricing bundles to balance supply and demand. We provide concrete convergence rates for auctions in this class, bounding the number of auction rounds needed to reach clearing prices. Our analysis allows for a variety of pricing schemes, including item, bundle, and polynomial pricing, and the respective convergence rates confirm that more expressive pricing schemes come at the cost of slower convergence. We consider two models of bidder behavior. In the first model, bidders behave stochastically according to a random utility model, which includes standard best-response bidding as a special case. In the second model, bidders behave arbitrarily (even adversarially), and meaningful convergence relies on properly designed activity rules.
△ Less
Submitted 31 May, 2016; v1 submitted 18 November, 2015;
originally announced November 2015.
-
Spectral Smoothing via Random Matrix Perturbations
Authors:
Jacob Abernethy,
Chansoo Lee,
Ambuj Tewari
Abstract:
We consider stochastic smoothing of spectral functions of matrices using perturbations commonly studied in random matrix theory. We show that a spectral function remains spectral when smoothed using a unitarily invariant perturbation distribution. We then derive state-of-the-art smoothing bounds for the maximum eigenvalue function using the Gaussian Orthogonal Ensemble (GOE). Smoothing the maximum…
▽ More
We consider stochastic smoothing of spectral functions of matrices using perturbations commonly studied in random matrix theory. We show that a spectral function remains spectral when smoothed using a unitarily invariant perturbation distribution. We then derive state-of-the-art smoothing bounds for the maximum eigenvalue function using the Gaussian Orthogonal Ensemble (GOE). Smoothing the maximum eigenvalue function is important for applications in semidefinite optimization and online learning. As a direct consequence of our GOE smoothing results, we obtain an $O((N \log N)^{1/4} \sqrt{T})$ expected regret bound for the online variance minimization problem using an algorithm that performs only a single maximum eigenvector computation per time step. Here $T$ is the number of rounds and $N$ is the matrix dimension. Our algorithm and its analysis also extend to the more general online PCA problem where the learner has to output a rank $k$ subspace. The algorithm just requires computing $k$ maximum eigenvectors per step and enjoys an $O(k (N \log N)^{1/4} \sqrt{T})$ expected regret bound.
△ Less
Submitted 13 December, 2015; v1 submitted 10 July, 2015;
originally announced July 2015.
-
Faster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier
Authors:
Jacob Abernethy,
Elad Hazan
Abstract:
This paper explores a surprising equivalence between two seemingly-distinct convex optimization methods. We show that simulated annealing, a well-studied random walk algorithms, is directly equivalent, in a certain sense, to the central path interior point algorithm for the the entropic universal barrier function. This connection exhibits several benefits. First, we are able improve the state of t…
▽ More
This paper explores a surprising equivalence between two seemingly-distinct convex optimization methods. We show that simulated annealing, a well-studied random walk algorithms, is directly equivalent, in a certain sense, to the central path interior point algorithm for the the entropic universal barrier function. This connection exhibits several benefits. First, we are able improve the state of the art time complexity for convex optimization under the membership oracle model. We improve the analysis of the randomized algorithm of Kalai and Vempala by utilizing tools developed by Nesterov and Nemirovskii that underly the central path following interior point algorithm. We are able to tighten the temperature schedule for simulated annealing which gives an improved running time, reducing by square root of the dimension in certain instances. Second, we get an efficient randomized interior point method with an efficiently computable universal barrier for any convex set described by a membership oracle. Previously, efficiently computable barriers were known only for particular convex sets.
△ Less
Submitted 5 November, 2015; v1 submitted 9 July, 2015;
originally announced July 2015.
-
Low-Cost Learning via Active Data Procurement
Authors:
Jacob Abernethy,
Yiling Chen,
Chien-Ju Ho,
Bo Waggoner
Abstract:
We design mechanisms for online procurement of data held by strategic agents for machine learning tasks. The challenge is to use past data to actively price future data and give learning guarantees even when an agent's cost for revealing her data may depend arbitrarily on the data itself. We achieve this goal by showing how to convert a large class of no-regret algorithms into online posted-price…
▽ More
We design mechanisms for online procurement of data held by strategic agents for machine learning tasks. The challenge is to use past data to actively price future data and give learning guarantees even when an agent's cost for revealing her data may depend arbitrarily on the data itself. We achieve this goal by showing how to convert a large class of no-regret algorithms into online posted-price and learning mechanisms. Our results in a sense parallel classic sample complexity guarantees, but with the key resource being money rather than quantity of data: With a budget constraint $B$, we give robust risk (predictive error) bounds on the order of $1/\sqrt{B}$. Because we use an active approach, we can often guarantee to do significantly better by leveraging correlations between costs and data.
Our algorithms and analysis go through a model of no-regret learning with $T$ arriving pairs (cost, data) and a budget constraint of $B$. Our regret bounds for this model are on the order of $T/\sqrt{B}$ and we give lower bounds on the same order.
△ Less
Submitted 5 June, 2015; v1 submitted 20 February, 2015;
originally announced February 2015.
-
Online Linear Optimization via Smoothing
Authors:
Jacob Abernethy,
Chansoo Lee,
Abhinav Sinha,
Ambuj Tewari
Abstract:
We present a new optimization-theoretic approach to analyzing Follow-the-Leader style algorithms, particularly in the setting where perturbations are used as a tool for regularization. We show that adding a strongly convex penalty function to the decision rule and adding stochastic perturbations to data correspond to deterministic and stochastic smoothing operations, respectively. We establish an…
▽ More
We present a new optimization-theoretic approach to analyzing Follow-the-Leader style algorithms, particularly in the setting where perturbations are used as a tool for regularization. We show that adding a strongly convex penalty function to the decision rule and adding stochastic perturbations to data correspond to deterministic and stochastic smoothing operations, respectively. We establish an equivalence between "Follow the Regularized Leader" and "Follow the Perturbed Leader" up to the smoothness properties. This intuition leads to a new generic analysis framework that recovers and improves the previous known regret bounds of the class of algorithms commonly known as Follow the Perturbed Leader.
△ Less
Submitted 23 May, 2014;
originally announced May 2014.
-
Information Aggregation in Exponential Family Markets
Authors:
Jacob Abernethy,
Sindhu Kutty,
Sébastien Lahaie,
Rahul Sami
Abstract:
We consider the design of prediction market mechanisms known as automated market makers. We show that we can design these mechanisms via the mold of \emph{exponential family distributions}, a popular and well-studied probability distribution template used in statistics. We give a full development of this relationship and explore a range of benefits. We draw connections between the information aggr…
▽ More
We consider the design of prediction market mechanisms known as automated market makers. We show that we can design these mechanisms via the mold of \emph{exponential family distributions}, a popular and well-studied probability distribution template used in statistics. We give a full development of this relationship and explore a range of benefits. We draw connections between the information aggregation of market prices and the belief aggregation of learning agents that rely on exponential family distributions. We develop a very natural analysis of the market behavior as well as the price equilibrium under the assumption that the traders exhibit risk aversion according to exponential utility. We also consider similar aspects under alternative models, such as when traders are budget constrained.
△ Less
Submitted 21 February, 2014;
originally announced February 2014.
-
Minimax Option Pricing Meets Black-Scholes in the Limit
Authors:
Jacob Abernethy,
Rafael M. Frongillo,
Andre Wibisono
Abstract:
Option contracts are a type of financial derivative that allow investors to hedge risk and speculate on the variation of an asset's future market price. In short, an option has a particular payout that is based on the market price for an asset on a given date in the future. In 1973, Black and Scholes proposed a valuation model for options that essentially estimates the tail risk of the asset price…
▽ More
Option contracts are a type of financial derivative that allow investors to hedge risk and speculate on the variation of an asset's future market price. In short, an option has a particular payout that is based on the market price for an asset on a given date in the future. In 1973, Black and Scholes proposed a valuation model for options that essentially estimates the tail risk of the asset price under the assumption that the price will fluctuate according to geometric Brownian motion. More recently, DeMarzo et al., among others, have proposed more robust valuation schemes, where we can even assume an adversary chooses the price fluctuations. This framework can be considered as a sequential two-player zero-sum game between the investor and Nature. We analyze the value of this game in the limit, where the investor can trade at smaller and smaller time intervals. Under weak assumptions on the actions of Nature (an adversary), we show that the minimax option price asymptotically approaches exactly the Black-Scholes valuation. The key piece of our analysis is showing that Nature's minimax optimal dual strategy converges to geometric Brownian motion in the limit.
△ Less
Submitted 12 February, 2012;
originally announced February 2012.
-
A Collaborative Mechanism for Crowdsourcing Prediction Problems
Authors:
Jacob Abernethy,
Rafael M. Frongillo
Abstract:
Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of "crowdsourcing" prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively "learn" a hypothesis for a g…
▽ More
Machine Learning competitions such as the Netflix Prize have proven reasonably successful as a method of "crowdsourcing" prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively "learn" a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will profit an amount that scales according to how much her update improves performance on a released test set.
△ Less
Submitted 11 November, 2011;
originally announced November 2011.
-
An Optimization-Based Framework for Automated Market-Making
Authors:
Jacob Abernethy,
Yiling Chen,
Jennifer Wortman Vaughan
Abstract:
Building on ideas from online convex optimization, we propose a general framework for the design of efficient securities markets over very large outcome spaces. The challenge here is computational. In a complete market, in which one security is offered for each outcome, the market institution can not efficiently keep track of the transaction history or calculate security prices when the outcome sp…
▽ More
Building on ideas from online convex optimization, we propose a general framework for the design of efficient securities markets over very large outcome spaces. The challenge here is computational. In a complete market, in which one security is offered for each outcome, the market institution can not efficiently keep track of the transaction history or calculate security prices when the outcome space is large. The natural solution is to restrict the space of securities to be much smaller than the outcome space in such a way that securities can be priced efficiently. Recent research has focused on searching for spaces of securities that can be priced efficiently by existing mechanisms designed for complete markets. While there have been some successes, much of this research has led to hardness results. In this paper, we take a drastically different approach. We start with an arbitrary space of securities with bounded payoff, and establish a framework to design markets tailored to this space. We prove that any market satisfying a set of intuitive conditions must price securities via a convex potential function and that the space of reachable prices must be precisely the convex hull of the security payoffs. We then show how the convex potential function can be defined in terms of an optimization over the convex hull of the security payoffs. The optimal solution to the optimization problem gives the security prices. Using this framework, we provide an efficient market for predicting the landing location of an object on a sphere. In addition, we show that we can relax our "no-arbitrage" condition to design a new efficient market maker for pair betting, which is known to be #P-hard to price using existing mechanisms. This relaxation also allows the market maker to charge transaction fees so that the depth of the market can be dynamically increased as the number of trades increases.
△ Less
Submitted 8 November, 2010;
originally announced November 2010.
-
Blackwell Approachability and Low-Regret Learning are Equivalent
Authors:
Jacob Abernethy,
Peter L. Bartlett,
Elad Hazan
Abstract:
We consider the celebrated Blackwell Approachability Theorem for two-player games with vector payoffs. We show that Blackwell's result is equivalent, via efficient reductions, to the existence of "no-regret" algorithms for Online Linear Optimization. Indeed, we show that any algorithm for one such problem can be efficiently converted into an algorithm for the other. We provide a useful application…
▽ More
We consider the celebrated Blackwell Approachability Theorem for two-player games with vector payoffs. We show that Blackwell's result is equivalent, via efficient reductions, to the existence of "no-regret" algorithms for Online Linear Optimization. Indeed, we show that any algorithm for one such problem can be efficiently converted into an algorithm for the other. We provide a useful application of this reduction: the first efficient algorithm for calibrated forecasting.
△ Less
Submitted 8 November, 2010;
originally announced November 2010.
-
A Stochastic View of Optimal Regret through Minimax Duality
Authors:
Jacob Abernethy,
Alekh Agarwal,
Peter L. Bartlett,
Alexander Rakhlin
Abstract:
We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference betwee…
▽ More
We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional--the minimizer over the player's actions of expected loss--defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.
△ Less
Submitted 30 March, 2009;
originally announced March 2009.
-
A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
Authors:
Jacob Abernethy,
Francis Bach,
Theodoros Evgeniou,
Jean-Philippe Vert
Abstract:
We present a general approach for collaborative filtering (CF) using spectral regularization to learn linear operators from "users" to the "objects" they rate. Recent low-rank type matrix completion approaches to CF are shown to be special cases. However, unlike existing regularization based CF methods, our approach can be used to also incorporate information such as attributes of the users or t…
▽ More
We present a general approach for collaborative filtering (CF) using spectral regularization to learn linear operators from "users" to the "objects" they rate. Recent low-rank type matrix completion approaches to CF are shown to be special cases. However, unlike existing regularization based CF methods, our approach can be used to also incorporate information such as attributes of the users or the objects -- a limitation of existing regularization based CF methods. We then provide novel representer theorems that we use to develop new estimation methods. We provide learning algorithms based on low-rank decompositions, and test them on a standard CF dataset. The experiments indicate the advantages of generalizing the existing regularization based CF methods to incorporate related information about users and objects. Finally, we show that certain multi-task learning methods can be also seen as special cases of our proposed approach.
△ Less
Submitted 19 December, 2008; v1 submitted 11 February, 2008;
originally announced February 2008.
-
Low-rank matrix factorization with attributes
Authors:
Jacob Abernethy,
Francis Bach,
Theodoros Evgeniou,
Jean-Philippe Vert
Abstract:
We develop a new collaborative filtering (CF) method that combines both previously known users' preferences, i.e. standard CF, as well as product/user attributes, i.e. classical function approximation, to predict a given user's interest in a particular product. Our method is a generalized low rank matrix completion problem, where we learn a function whose inputs are pairs of vectors -- the stand…
▽ More
We develop a new collaborative filtering (CF) method that combines both previously known users' preferences, i.e. standard CF, as well as product/user attributes, i.e. classical function approximation, to predict a given user's interest in a particular product. Our method is a generalized low rank matrix completion problem, where we learn a function whose inputs are pairs of vectors -- the standard low rank matrix completion problem being a special case where the inputs to the function are the row and column indices of the matrix. We solve this generalized matrix completion problem using tensor product kernels for which we also formally generalize standard kernel properties. Benchmark experiments on movie ratings show the advantages of our generalized matrix completion method over the standard matrix completion one with no information about movies or people, as well as over standard multi-task or single task learning methods.
△ Less
Submitted 24 November, 2006;
originally announced November 2006.