Search | arXiv e-print repository

Improved bounds for calibration via stronger sign preservation games

Authors: Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich, Robert Kleinberg, Princewill Okoroafor

Abstract: A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with $O(T^{2/3})$ calibration… ▽ More A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with $O(T^{2/3})$ calibration error after $T$ time steps, and showed a lower bound of $Ω(T^{1/2})$. These bounds remained stagnant for two decades, until Qiao & Valiant (2021) improved the lower bound to $Ω(T^{0.528})$ by introducing a combinatorial game called sign preservation and showing that lower bounds for this game imply lower bounds for calibration. We introduce a strengthening of Qiao & Valiant's game that we call sign preservation with reuse (SPR). We prove that the relationship between SPR and calibrated forecasting is bidirectional: not only do lower bounds for SPR translate into lower bounds for calibration, but algorithms for SPR also translate into new algorithms for calibrated forecasting. In particular, any strategy that improves the trivial upper bound for the value of the SPR game would imply a forecasting algorithm with calibration error exponent less than 2/3, improving Foster & Vohra's upper bound for the first time. Using similar ideas, we then prove a slightly stronger lower bound than that of Qiao & Valiant, namely $Ω(T^{0.54389})$. Our lower bound is obtained by an oblivious adversary, marking the first $ω(T^{1/2})$ calibration lower bound for oblivious adversaries. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13116 [pdf, ps, other]

A Lower Bound on Swap Regret in Extensive-Form Games

Authors: Constantinos Daskalakis, Gabriele Farina, Noah Golowich, Tuomas Sandholm, Brian Hu Zhang

Abstract: Recent simultaneous works by Peng and Rubinstein [2024] and Dagan et al. [2024] have demonstrated the existence of a no-swap-regret learning algorithm that can reach $ε$ average swap regret against an adversary in any extensive-form game within $m^{\tilde{\mathcal O}(1/ε)}$ rounds, where $m$ is the number of nodes in the game tree. However, the question of whether a $\mathrm{poly}(m, 1/ε)$-round a… ▽ More Recent simultaneous works by Peng and Rubinstein [2024] and Dagan et al. [2024] have demonstrated the existence of a no-swap-regret learning algorithm that can reach $ε$ average swap regret against an adversary in any extensive-form game within $m^{\tilde{\mathcal O}(1/ε)}$ rounds, where $m$ is the number of nodes in the game tree. However, the question of whether a $\mathrm{poly}(m, 1/ε)$-round algorithm could exist remained open. In this paper, we show a lower bound that precludes the existence of such an algorithm. In particular, we show that achieving average swap regret $ε$ against an oblivious adversary in general extensive-form games requires at least $\mathrm{exp}\left(Ω\left(\min\left\{m^{1/14}, ε^{-1/6}\right\}\right)\right)$ rounds. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11686 [pdf, ps, other]

The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation

Authors: Noah Golowich, Ankur Moitra

Abstract: In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a com… ▽ More In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error ${\varepsilon_{\mathrm{BE}}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\varepsilon_{\mathrm{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: RLC 2024

arXiv:2406.11667 [pdf, ps, other]

Is Efficient PAC Learning Possible with an Oracle That Responds 'Yes' or 'No'?

Authors: Constantinos Daskalakis, Noah Golowich

Abstract: The empirical risk minimization (ERM) principle has been highly impactful in machine learning, leading both to near-optimal theoretical guarantees for ERM-based learning algorithms as well as driving many of the recent empirical successes in deep learning. In this paper, we investigate the question of whether the ability to perform ERM, which computes a hypothesis minimizing empirical risk on a gi… ▽ More The empirical risk minimization (ERM) principle has been highly impactful in machine learning, leading both to near-optimal theoretical guarantees for ERM-based learning algorithms as well as driving many of the recent empirical successes in deep learning. In this paper, we investigate the question of whether the ability to perform ERM, which computes a hypothesis minimizing empirical risk on a given dataset, is necessary for efficient learning: in particular, is there a weaker oracle than ERM which can nevertheless enable learnability? We answer this question affirmatively, showing that in the realizable setting of PAC learning for binary classification, a concept class can be learned using an oracle which only returns a single bit indicating whether a given dataset is realizable by some concept in the class. The sample complexity and oracle complexity of our algorithm depend polynomially on the VC dimension of the hypothesis class, thus showing that there is only a polynomial price to pay for use of our weaker oracle. Our results extend to the agnostic learning setting with a slight strengthening of the oracle, as well as to the partial concept, multiclass and real-valued learning settings. In the setting of partial concept classes, prior to our work no oracle-efficient algorithms were known, even with a standard ERM oracle. Thus, our results address a question of Alon et al. (2021) who asked whether there are algorithmic principles which enable efficient learnability in this setting. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: COLT 2024

arXiv:2406.11640 [pdf, ps, other]

Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions

Authors: Noah Golowich, Ankur Moitra

Abstract: One of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well-specified. We… ▽ More One of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well-specified. We study the problem of learning an optimal policy under Bellman completeness in the online model of RL with linear function approximation. In the linear setting, while statistically efficient algorithms are known under Bellman completeness (e.g., Jiang et al. (2017); Zanette et al. (2020)), these algorithms all rely on the principle of global optimism which requires solving a nonconvex optimization problem. In particular, it has remained open as to whether computationally efficient algorithms exist. In this paper we give the first polynomial-time algorithm for RL under linear Bellman completeness when the number of actions is any constant. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: COLT 2024

arXiv:2406.07920 [pdf, ps, other]

Near-Optimal Learning and Planning in Separated Latent MDPs

Authors: Fan Chen, Constantinos Daskalakis, Noah Golowich, Alexander Rakhlin

Abstract: We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs). In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs. To sidestep known impossibility results, we consider several notions of separation of the constituent MDPs. The main thrust of this paper is in establishing a nearly-sharp *statist… ▽ More We study computational and statistical aspects of learning Latent Markov Decision Processes (LMDPs). In this model, the learner interacts with an MDP drawn at the beginning of each epoch from an unknown mixture of MDPs. To sidestep known impossibility results, we consider several notions of separation of the constituent MDPs. The main thrust of this paper is in establishing a nearly-sharp *statistical threshold* for the horizon length necessary for efficient learning. On the computational side, we show that under a weaker assumption of separability under the optimal policy, there is a quasi-polynomial algorithm with time complexity scaling in terms of the statistical threshold. We further show a near-matching time complexity lower bound under the exponential time hypothesis. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: COLT 2024

arXiv:2406.02633 [pdf, ps, other]

Edit Distance Robust Watermarks for Language Models

Authors: Noah Golowich, Ankur Moitra

Abstract: Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn & Zamir (2024) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual o… ▽ More Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn & Zamir (2024) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both undetectability and robustness to edits when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ & Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our key idea is to handle adversarial insertions and deletions by interpreting the symbols as indices into the codeword, which we call indexing pseudorandom codes. Additionally, our codes rely on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01799 [pdf, other]

Online Control in Population Dynamics

Authors: Noah Golowich, Elad Hazan, Zhou Lu, Dhruv Rohatgi, Y. Jennifer Sun

Abstract: The study of population dynamics originated with early sociological works but has since extended into many fields, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for control in population dynamics are often restricted to specific, noise-free dynamics,… ▽ More The study of population dynamics originated with early sociological works but has since extended into many fields, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for control in population dynamics are often restricted to specific, noise-free dynamics, while real-world population changes can be complex and adversarial. To address this gap, we propose a new framework based on the paradigm of online control. We first characterize a set of linear dynamical systems that can naturally model evolving populations. We then give an efficient gradient-based controller for these systems, with near-optimal regret bounds with respect to a broad class of linear policies. Our empirical evaluations demonstrate the effectiveness of the proposed algorithm for control in population dynamics even for non-linear models such as SIR and replicator dynamics. △ Less

Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

arXiv:2404.11325 [pdf, ps, other]

On Learning Parities with Dependent Noise

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: In this expository note we show that the learning parities with noise (LPN) assumption is robust to weak dependencies in the noise distribution of small batches of samples. This provides a partial converse to the linearization technique of [AG11]. The material in this note is drawn from a recent work by the authors [GMR24], where the robustness guarantee was a key component in a cryptographic sepa… ▽ More In this expository note we show that the learning parities with noise (LPN) assumption is robust to weak dependencies in the noise distribution of small batches of samples. This provides a partial converse to the linearization technique of [AG11]. The material in this note is drawn from a recent work by the authors [GMR24], where the robustness guarantee was a key component in a cryptographic separation between reinforcement learning and supervised learning. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: This note draws heavily from arXiv:2404.03774

arXiv:2404.03774 [pdf, other]

Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free explorati… ▽ More Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem. It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: 112 pages, 3 figures

arXiv:2310.19786 [pdf, ps, other]

From External to Swap Regret 2.0: An Efficient Reduction and Oblivious Adversary for Large Action Spaces

Authors: Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich

Abstract: We provide a novel reduction from swap-regret minimization to external-regret minimization, which improves upon the classical reductions of Blum-Mansour [BM07] and Stolz-Lugosi [SL05] in that it does not require finiteness of the space of actions. We show that, whenever there exists a no-external-regret algorithm for some hypothesis class, there must also exist a no-swap-regret algorithm for that… ▽ More We provide a novel reduction from swap-regret minimization to external-regret minimization, which improves upon the classical reductions of Blum-Mansour [BM07] and Stolz-Lugosi [SL05] in that it does not require finiteness of the space of actions. We show that, whenever there exists a no-external-regret algorithm for some hypothesis class, there must also exist a no-swap-regret algorithm for that same class. For the problem of learning with expert advice, our result implies that it is possible to guarantee that the swap regret is bounded by ε after $\log(N)^{O(1/ε)}$ rounds and with $O(N)$ per iteration complexity, where $N$ is the number of experts, while the classical reductions of Blum-Mansour and Stolz-Lugosi require $O(N/ε^2)$ rounds and at least $Ω(N^2)$ per iteration complexity. Our result comes with an associated lower bound, which -- in contrast to that in [BM07] -- holds for oblivious and $\ell_1$-constrained adversaries and learners that can employ distributions over experts, showing that the number of rounds must be $\tildeΩ(N/ε^2)$ or exponential in $1/ε$. Our reduction implies that, if no-regret learning is possible in some game, then this game must have approximate correlated equilibria, of arbitrarily good approximation. This strengthens the folklore implication of no-regret learning that approximate coarse correlated equilibria exist. Importantly, it provides a sufficient condition for the existence of correlated equilibrium which vastly extends the requirement that the action set is finite, thus answering a question left open by [DG22; Ass+23]. Moreover, it answers several outstanding questions about equilibrium computation and learning in games. △ Less

Submitted 6 December, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

arXiv:2309.12226 [pdf, ps, other]

Smooth Nash Equilibria: Algorithms and Complexity

Authors: Constantinos Daskalakis, Noah Golowich, Nika Haghtalab, Abhishek Shetty

Abstract: A fundamental shortcoming of the concept of Nash equilibrium is its computational intractability: approximating Nash equilibria in normal-form games is PPAD-hard. In this paper, inspired by the ideas of smoothed analysis, we introduce a relaxed variant of Nash equilibrium called $σ$-smooth Nash equilibrium, for a smoothness parameter $σ$. In a $σ$-smooth Nash equilibrium, players only need to achi… ▽ More A fundamental shortcoming of the concept of Nash equilibrium is its computational intractability: approximating Nash equilibria in normal-form games is PPAD-hard. In this paper, inspired by the ideas of smoothed analysis, we introduce a relaxed variant of Nash equilibrium called $σ$-smooth Nash equilibrium, for a smoothness parameter $σ$. In a $σ$-smooth Nash equilibrium, players only need to achieve utility at least as high as their best deviation to a $σ$-smooth strategy, which is a distribution that does not put too much mass (as parametrized by $σ$) on any fixed action. We distinguish two variants of $σ$-smooth Nash equilibria: strong $σ$-smooth Nash equilibria, in which players are required to play $σ$-smooth strategies under equilibrium play, and weak $σ$-smooth Nash equilibria, where there is no such requirement. We show that both weak and strong $σ$-smooth Nash equilibria have superior computational properties to Nash equilibria: when $σ$ as well as an approximation parameter $ε$ and the number of players are all constants, there is a constant-time randomized algorithm to find a weak $ε$-approximate $σ$-smooth Nash equilibrium in normal-form games. In the same parameter regime, there is a polynomial-time deterministic algorithm to find a strong $ε$-approximate $σ$-smooth Nash equilibrium in a normal-form game. These results stand in contrast to the optimal algorithm for computing $ε$-approximate Nash equilibria, which cannot run in faster than quasipolynomial-time. We complement our upper bounds by showing that when either $σ$ or $ε$ is an inverse polynomial, finding a weak $ε$-approximate $σ$-smooth Nash equilibria becomes computationally intractable. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2309.09457 [pdf, ps, other]

Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $φ(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen s… ▽ More The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $φ(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a $k$-sparse linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing all the relevant features, and the goal is to learn a near-optimal policy in only poly$(k,\log d)$ interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible. △ Less

Submitted 18 September, 2023; v1 submitted 17 September, 2023; originally announced September 2023.

arXiv:2305.00684 [pdf, other]

On the Complexity of Multi-Agent Decision Making: From Learning in Games to Partial Monitoring

Authors: Dylan J. Foster, Dean P. Foster, Noah Golowich, Alexander Rakhlin

Abstract: A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees, and how these considerations change as we move from few to many agents. We study this question in a general framework for interactive decision making with multiple agents, encompassing Markov games with fun… ▽ More A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees, and how these considerations change as we move from few to many agents. We study this question in a general framework for interactive decision making with multiple agents, encompassing Markov games with function approximation and normal-form games with bandit feedback. We focus on equilibrium computation, in which a centralized learning algorithm aims to compute an equilibrium by controlling multiple agents that interact with an unknown environment. Our main contributions are: - We provide upper and lower bounds on the optimal sample complexity for multi-agent decision making based on a multi-agent generalization of the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. (2021) in the single-agent counterpart to our setting. Compared to the best results for the single-agent setting, our bounds have additional gaps. We show that no "reasonable" complexity measure can close these gaps, highlighting a striking separation between single and multiple agents. - We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making, but with hidden (unobserved) rewards, a framework that subsumes variants of the partial monitoring problem. As a consequence, we characterize the statistical complexity for hidden-reward interactive decision making to the best extent possible. Building on this development, we provide several new structural results, including 1) conditions under which the statistical complexity of multi-agent decision making can be reduced to that of single-agent, and 2) conditions under which the so-called curse of multiple agents can be avoided. △ Less

Submitted 1 May, 2023; originally announced May 2023.

Comments: 95 pages

arXiv:2303.12287 [pdf, ps, other]

Hardness of Independent Learning and Sparse Equilibrium Computation in Markov Games

Authors: Dylan J. Foster, Noah Golowich, Sham M. Kakade

Abstract: We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results in normal-form games. While recent work has shown that such algorithms exist for restric… ▽ More We consider the problem of decentralized multi-agent reinforcement learning in Markov games. A fundamental question is whether there exist algorithms that, when adopted by all agents and run independently in a decentralized fashion, lead to no-regret for each player, analogous to celebrated convergence results in normal-form games. While recent work has shown that such algorithms exist for restricted settings (notably, when regret is defined with respect to deviations to Markovian policies), the question of whether independent no-regret learning can be achieved in the standard Markov game framework was open. We provide a decisive negative resolution this problem, both from a computational and statistical perspective. We show that: - Under the widely-believed assumption that PPAD-hard problems cannot be solved in polynomial time, there is no polynomial-time algorithm that attains no-regret in general-sum Markov games when executed independently by all players, even when the game is known to the algorithm designer and the number of players is a small constant. - When the game is unknown, no algorithm, regardless of computational efficiency, can achieve no-regret without observing a number of episodes that is exponential in the number of players. Perhaps surprisingly, our lower bounds hold even for seemingly easier setting in which all agents are controlled by a a centralized algorithm. They are proven via lower bounds for a simpler problem we refer to as SparseCCE, in which the goal is to compute a coarse correlated equilibrium that is sparse in the sense that it can be represented as a mixture of a small number of product policies. The crux of our approach is a novel application of aggregation techniques from online learning, whereby we show that any algorithm for the SparseCCE problem can be used to compute approximate Nash equilibria for non-zero sum normal-form games. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: 51 pages

arXiv:2301.08215 [pdf, other]

Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient

Authors: Dylan J. Foster, Noah Golowich, Yanjun Han

Abstract: A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds… ▽ More A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use. △ Less

Submitted 19 January, 2023; originally announced January 2023.

arXiv:2211.14250 [pdf, other]

Model-Free Reinforcement Learning with the Decision-Estimation Coefficient

Authors: Dylan J. Foster, Noah Golowich, Jian Qian, Alexander Rakhlin, Ayush Sekhari

Abstract: We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which ach… ▽ More We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this paper, we show that by combining Estimation-to-Decisions with a specialized form of optimistic estimation introduced by Zhang (2022), it is possible to obtain guarantees that improve upon those of Foster et al. (2021) by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation, and give structural results showing when it can and cannot help more generally. △ Less

Submitted 12 August, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: V2 changes: Improved writing and added more examples

arXiv:2210.09769 [pdf, other]

STay-ON-the-Ridge: Guaranteed Convergence to Local Minimax Equilibrium in Nonconvex-Nonconcave Games

Authors: Constantinos Daskalakis, Noah Golowich, Stratis Skoulakis, Manolis Zampetakis

Abstract: Min-max optimization problems involving nonconvex-nonconcave objectives have found important applications in adversarial training and other multi-agent learning settings. Yet, no known gradient descent-based method is guaranteed to converge to (even local notions of) min-max equilibrium in the nonconvex-nonconcave setting. For all known methods, there exist relatively simple objectives for which t… ▽ More Min-max optimization problems involving nonconvex-nonconcave objectives have found important applications in adversarial training and other multi-agent learning settings. Yet, no known gradient descent-based method is guaranteed to converge to (even local notions of) min-max equilibrium in the nonconvex-nonconcave setting. For all known methods, there exist relatively simple objectives for which they cycle or exhibit other undesirable behavior different from converging to a point, let alone to some game-theoretically meaningful one~\cite{flokas2019poincare,hsieh2021limits}. The only known convergence guarantees hold under the strong assumption that the initialization is very close to a local min-max equilibrium~\cite{wang2019solving}. Moreover, the afore-described challenges are not just theoretical curiosities. All known methods are unstable in practice, even in simple settings. We propose the first method that is guaranteed to converge to a local min-max equilibrium for smooth nonconvex-nonconcave objectives. Our method is second-order and provably escapes limit cycles as long as it is initialized at an easy-to-find initial point. Both the definition of our method and its convergence analysis are motivated by the topological nature of the problem. In particular, our method is not designed to decrease some potential function, such as the distance of its iterate from the set of local min-max equilibria or the projected gradient of the objective, but is designed to satisfy a topological property that guarantees the avoidance of cycles and implies its convergence. △ Less

Submitted 18 October, 2022; originally announced October 2022.

arXiv:2206.03446 [pdf, ps, other]

Learning in Observable POMDPs, without Computationally Intractable Oracles

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planni… ▽ More Much of reinforcement learning theory is built on top of oracles that are computationally hard to implement. Specifically for learning near-optimal policies in Partially Observable Markov Decision Processes (POMDPs), existing algorithms either need to make strong assumptions about the model dynamics (e.g. deterministic transitions) or assume access to an oracle for solving a hard optimistic planning or estimation problem as a subroutine. In this work we develop the first oracle-free learning algorithm for POMDPs under reasonable assumptions. Specifically, we give a quasipolynomial-time end-to-end algorithm for learning in "observable" POMDPs, where observability is the assumption that well-separated distributions over states induce well-separated distributions over observations. Our techniques circumvent the more traditional approach of using the principle of optimism under uncertainty to promote exploration, and instead give a novel application of barycentric spanners to constructing policy covers. △ Less

Submitted 7 June, 2022; originally announced June 2022.

arXiv:2204.03991 [pdf, ps, other]

The Complexity of Markov Equilibrium in Stochastic Games

Authors: Constantinos Daskalakis, Noah Golowich, Kaiqing Zhang

Abstract: We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is computationally intractable, even when there are two players, the game is turn-based, the discount factor is an absolute constant, and the approximation is an absolute constant. Our intractability results stand in sharp contrast to normal-form games where exact CCEs are effici… ▽ More We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is computationally intractable, even when there are two players, the game is turn-based, the discount factor is an absolute constant, and the approximation is an absolute constant. Our intractability results stand in sharp contrast to normal-form games where exact CCEs are efficiently computable. A fortiori, our results imply that there are no efficient algorithms for learning stationary Markov CCE policies in multi-agent reinforcement learning (MARL), even when the interaction is two-player and turn-based, and both the discount factor and the desired approximation of the learned policies is an absolute constant. In turn, these results stand in sharp contrast to single-agent reinforcement learning (RL) where near-optimal stationary Markov policies can be efficiently learned. Complementing our intractability results for stationary Markov CCEs, we provide a decentralized algorithm (assuming shared randomness among players) for learning a nonstationary Markov CCE policy with polynomial time and sample complexity in all problem parameters. Previous work for learning Markov CCE policies all required exponential time and sample complexity in the number of players. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: 50 pages

arXiv:2202.04690 [pdf, ps, other]

Smoothed Online Learning is as Easy as Statistical Learning

Authors: Adam Block, Yuval Dagan, Noah Golowich, Alexander Rakhlin

Abstract: Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the sm… ▽ More Much of modern learning theory has been split between two regimes: the classical offline setting, where data arrive independently, and the online setting, where data arrive adversarially. While the former model is often both computationally and statistically tractable, the latter requires no distributional assumptions. In an attempt to achieve the best of both worlds, previous work proposed the smooth online setting where each sample is drawn from an adversarially chosen distribution, which is smooth, i.e., it has a bounded density with respect to a fixed dominating measure. We provide tight bounds on the minimax regret of learning a nonparametric function class, with nearly optimal dependence on both the horizon and smoothness parameters. Furthermore, we provide the first oracle-efficient, no-regret algorithms in this setting. In particular, we propose an oracle-efficient improper algorithm whose regret achieves optimal dependence on the horizon and a proper algorithm requiring only a single oracle call per round whose regret has the optimal horizon dependence in the classification setting and is sublinear in general. Both algorithms have exponentially worse dependence on the smoothness parameter of the adversary than the minimax rate. We then prove a lower bound on the oracle complexity of any proper learning algorithm, which matches the oracle-efficient upper bounds up to a polynomial factor, thus demonstrating the existence of a statistical-computational gap in smooth online learning. Finally, we apply our results to the contextual bandit setting to show that if a function class is learnable in the classical setting, then there is an oracle-efficient, no-regret algorithm for contextual bandits in the case that contexts arrive in a smooth manner. △ Less

Submitted 31 May, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

arXiv:2201.04735 [pdf, ps, other]

Planning in Observable POMDPs in Quasipolynomial Time

Authors: Noah Golowich, Ankur Moitra, Dhruv Rohatgi

Abstract: Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost… ▽ More Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis. △ Less

Submitted 23 March, 2022; v1 submitted 12 January, 2022; originally announced January 2022.

Comments: 52 pages

arXiv:2111.12786 [pdf, ps, other]

Differentially Private Nonparametric Regression Under a Growth Condition

Authors: Noah Golowich

Abstract: Given a real-valued hypothesis class $\mathcal{H}$, we investigate under what conditions there is a differentially private algorithm which learns an optimal hypothesis from $\mathcal{H}$ given i.i.d. data. Inspired by recent results for the related setting of binary classification (Alon et al., 2019; Bun et al., 2020), where it was shown that online learnability of a binary class is necessary and… ▽ More Given a real-valued hypothesis class $\mathcal{H}$, we investigate under what conditions there is a differentially private algorithm which learns an optimal hypothesis from $\mathcal{H}$ given i.i.d. data. Inspired by recent results for the related setting of binary classification (Alon et al., 2019; Bun et al., 2020), where it was shown that online learnability of a binary class is necessary and sufficient for its private learnability, Jung et al. (2020) showed that in the setting of regression, online learnability of $\mathcal{H}$ is necessary for private learnability. Here online learnability of $\mathcal{H}$ is characterized by the finiteness of its $η$-sequential fat shattering dimension, ${\rm sfat}_η(\mathcal{H})$, for all $η> 0$. In terms of sufficient conditions for private learnability, Jung et al. (2020) showed that $\mathcal{H}$ is privately learnable if $\lim_{η\downarrow 0} {\rm sfat}_η(\mathcal{H})$ is finite, which is a fairly restrictive condition. We show that under the relaxed condition $\lim \inf_{η\downarrow 0} η\cdot {\rm sfat}_η(\mathcal{H}) = 0$, $\mathcal{H}$ is privately learnable, establishing the first nonparametric private learnability guarantee for classes $\mathcal{H}$ with ${\rm sfat}_η(\mathcal{H})$ diverging as $η\downarrow 0$. Our techniques involve a novel filtering procedure to output stable hypotheses for nonparametric function classes. △ Less

Submitted 24 November, 2021; originally announced November 2021.

Comments: 41 pages; appeared in COLT 2021

arXiv:2111.08911 [pdf, ps, other]

Fast Rates for Nonparametric Online Learning: From Realizability to Learning in Games

Authors: Constantinos Daskalakis, Noah Golowich

Abstract: We study fast rates of convergence in the setting of nonparametric online regression, namely where regret is defined with respect to an arbitrary function class which has bounded complexity. Our contributions are two-fold: - In the realizable setting of nonparametric online regression with the absolute loss, we propose a randomized proper learning algorithm which gets a near-optimal cumulative l… ▽ More We study fast rates of convergence in the setting of nonparametric online regression, namely where regret is defined with respect to an arbitrary function class which has bounded complexity. Our contributions are two-fold: - In the realizable setting of nonparametric online regression with the absolute loss, we propose a randomized proper learning algorithm which gets a near-optimal cumulative loss in terms of the sequential fat-shattering dimension of the hypothesis class. In the setting of online classification with a class of Littlestone dimension $d$, our bound reduces to $d \cdot {\rm poly} \log T$. This result answers a question as to whether proper learners could achieve near-optimal cumulative loss; previously, even for online classification, the best known cumulative loss was $\tilde O( \sqrt{dT})$. Further, for the real-valued (regression) setting, a cumulative loss bound with near-optimal scaling on sequential fat-shattering dimension was not even known for improper learners, prior to this work. - Using the above result, we exhibit an independent learning algorithm for general-sum binary games of Littlestone dimension $d$, for which each player achieves regret $\tilde O(d^{3/4} \cdot T^{1/4})$. This result generalizes analogous results of Syrgkanis et al. (2015) who showed that in finite games the optimal regret can be accelerated from $O(\sqrt{T})$ in the adversarial setting to $O(T^{1/4})$ in the game setting. To establish the above results, we introduce several new techniques, including: a hierarchical aggregation rule to achieve the optimal cumulative loss for real-valued classes, a multi-scale extension of the proper online realizable learner of Hanneke et al. (2021), an approach to show that the output of such nonparametric learning algorithms is stable, and a proof that the minimax theorem holds in all online learnable games. △ Less

Submitted 12 April, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

Comments: 61 pages

arXiv:2111.06008 [pdf, ps, other]

doi 10.1145/3519935.3520031

Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games

Authors: Ioannis Anagnostides, Constantinos Daskalakis, Gabriele Farina, Maxwell Fishelson, Noah Golowich, Tuomas Sandholm

Abstract: Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoup… ▽ More Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium at the rate of $\tilde{O}(T^{-1})$. This substantially improves over the prior best rate of convergence for correlated equilibria of $O(T^{-3/4})$ due to Chen and Peng (NeurIPS`20), and it is optimal -- within the no-regret framework -- up to polylogarithmic factors in $T$. To obtain these results, we develop new techniques for establishing higher-order smoothness for learning dynamics involving fixed point operations. Specifically, we establish that the no-internal-regret learning dynamics of Stoltz and Lugosi (Mach Learn`05) are equivalently simulated by no-external-regret dynamics on a combinatorial space. This allows us to trade the computation of the stationary distribution on a polynomial-sized Markov chain for a (much more well-behaved) linear transformation on an exponential-sized set, enabling us to leverage similar techniques as DFG to near-optimally bound the internal regret. Moreover, we establish an $O(\textrm{polylog}(T))$ no-swap-regret bound for the classic algorithm of Blum and Mansour (BM) (JMLR`07). We do so by introducing a technique based on the Cauchy Integral Formula that circumvents the more limited combinatorial arguments of DFG. In addition to shedding clarity on the near-optimal regret guarantees of BM, our arguments provide insights into the various ways in which the techniques by DFG can be extended and leveraged in the analysis of more involved learning algorithms. △ Less

Submitted 24 January, 2023; v1 submitted 10 November, 2021; originally announced November 2021.

Comments: Appeared at STOC 2022

arXiv:2110.13052 [pdf, ps, other]

Can Q-Learning be Improved with Advice?

Authors: Noah Golowich, Ankur Moitra

Abstract: Despite rapid progress in theoretical reinforcement learning (RL) over the last few years, most of the known guarantees are worst-case in nature, failing to take advantage of structure that may be known a priori about a given RL problem at hand. In this paper we address the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumve… ▽ More Despite rapid progress in theoretical reinforcement learning (RL) over the last few years, most of the known guarantees are worst-case in nature, failing to take advantage of structure that may be known a priori about a given RL problem at hand. In this paper we address the question of whether worst-case lower bounds for regret in online learning of Markov decision processes (MDPs) can be circumvented when information about the MDP, in the form of predictions about its optimal $Q$-value function, is given to the algorithm. We show that when the predictions about the optimal $Q$-value function satisfy a reasonably weak condition we call distillation, then we can improve regret bounds by replacing the set of state-action pairs with the set of state-action pairs on which the predictions are grossly inaccurate. This improvement holds for both uniform regret bounds and gap-based ones. Further, we are able to achieve this property with an algorithm that achieves sublinear regret when given arbitrary predictions (i.e., even those which are not a distillation). Our work extends a recent line of work on algorithms with predictions, which has typically focused on simple online problems such as caching and scheduling, to the more complex and general problem of reinforcement learning. △ Less

Submitted 25 October, 2021; originally announced October 2021.

arXiv:2108.06924 [pdf, other]

Near-Optimal No-Regret Learning in General Games

Authors: Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich

Abstract: We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that… ▽ More We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\rm poly}(\log T)$. Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias (Syrgkanis et al., 2015), and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games (Chen & Pen, 2020). A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\tilde{O}\left(\frac 1T\right)$. △ Less

Submitted 24 January, 2023; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: 40 pages

arXiv:2106.13513 [pdf, ps, other]

Littlestone Classes are Privately Online Learnable

Authors: Noah Golowich, Roi Livni

Abstract: We consider the problem of online classification under a privacy constraint. In this setting a learner observes sequentially a stream of labelled examples $(x_t, y_t)$, for $1 \leq t \leq T$, and returns at each iteration $t$ a hypothesis $h_t$ which is used to predict the label of each new example $x_t$. The learner's performance is measured by her regret against a known hypothesis class… ▽ More We consider the problem of online classification under a privacy constraint. In this setting a learner observes sequentially a stream of labelled examples $(x_t, y_t)$, for $1 \leq t \leq T$, and returns at each iteration $t$ a hypothesis $h_t$ which is used to predict the label of each new example $x_t$. The learner's performance is measured by her regret against a known hypothesis class $\mathcal{H}$. We require that the algorithm satisfies the following privacy constraint: the sequence $h_1, \ldots, h_T$ of hypotheses output by the algorithm needs to be an $(ε, δ)$-differentially private function of the whole input sequence $(x_1, y_1), \ldots, (x_T, y_T)$. We provide the first non-trivial regret bound for the realizable setting. Specifically, we show that if the class $\mathcal{H}$ has constant Littlestone dimension then, given an oblivious sequence of labelled examples, there is a private learner that makes in expectation at most $O(\log T)$ mistakes -- comparable to the optimal mistake bound in the non-private case, up to a logarithmic factor. Moreover, for general values of the Littlestone dimension $d$, the same mistake bound holds but with a doubly-exponential in $d$ factor. A recent line of work has demonstrated a strong connection between classes that are online learnable and those that are differentially-private learnable. Our results strengthen this connection and show that an online learning algorithm can in fact be directly privatized (in the realizable setting). We also discuss an adaptive setting and provide a sublinear regret bound of $O(\sqrt{T})$. △ Less

Submitted 25 June, 2021; originally announced June 2021.

arXiv:2102.06062 [pdf, other]

Deep Learning with Label Differential Privacy

Authors: Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Chiyuan Zhang

Abstract: The Randomized Response (RR) algorithm is a classical technique to improve robustness in survey aggregation, and has been widely adopted in applications with differential privacy guarantees. We propose a novel algorithm, Randomized Response with Prior (RRWithPrior), which can provide more accurate results while maintaining the same level of privacy guaranteed by RR. We then apply RRWithPrior to le… ▽ More The Randomized Response (RR) algorithm is a classical technique to improve robustness in survey aggregation, and has been widely adopted in applications with differential privacy guarantees. We propose a novel algorithm, Randomized Response with Prior (RRWithPrior), which can provide more accurate results while maintaining the same level of privacy guaranteed by RR. We then apply RRWithPrior to learn neural networks with label differential privacy (LabelDP), and show that when only the label needs to be protected, the model performance can be significantly improved over the previous state-of-the-art private baselines. Moreover, we study different ways to obtain priors, which when used with RRWithPrior can additionally improve the model performance, further reducing the accuracy gap between private and non-private models. We complement the empirical results with theoretical analysis showing that LabelDP is provably easier than protecting both the inputs and labels. △ Less

Submitted 26 October, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

Comments: NeurIPS 2021; 29 pages, 6 figures

arXiv:2101.04233 [pdf, other]

Independent Policy Gradient Methods for Competitive Reinforcement Learning

Authors: Constantinos Daskalakis, Dylan J. Foster, Noah Golowich

Abstract: We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy g… ▽ More We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent policy gradient methods in competitive RL; prior work has largely focused on centralized, coordinated procedures for equilibrium computation. △ Less

Submitted 11 January, 2021; originally announced January 2021.

Comments: Appeared at NeurIPS 2020

arXiv:2012.03893 [pdf, other]

Sample-efficient proper PAC learning with approximate differential privacy

Authors: Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi

Abstract: In this paper we prove that the sample complexity of properly learning a class of Littlestone dimension $d$ with approximate differential privacy is $\tilde O(d^6)$, ignoring privacy and accuracy parameters. This result answers a question of Bun et al. (FOCS 2020) by improving upon their upper bound of $2^{O(d)}$ on the sample complexity. Prior to our work, finiteness of the sample complexity for… ▽ More In this paper we prove that the sample complexity of properly learning a class of Littlestone dimension $d$ with approximate differential privacy is $\tilde O(d^6)$, ignoring privacy and accuracy parameters. This result answers a question of Bun et al. (FOCS 2020) by improving upon their upper bound of $2^{O(d)}$ on the sample complexity. Prior to our work, finiteness of the sample complexity for privately learning a class of finite Littlestone dimension was only known for improper private learners, and the fact that our learner is proper answers another question of Bun et al., which was also asked by Bousquet et al. (NeurIPS 2020). Using machinery developed by Bousquet et al., we then show that the sample complexity of sanitizing a binary hypothesis class is at most polynomial in its Littlestone dimension and dual Littlestone dimension. This implies that a class is sanitizable if and only if it has finite Littlestone dimension. An important ingredient of our proofs is a new property of binary hypothesis classes that we call irreducibility, which may be of independent interest. △ Less

Submitted 7 December, 2020; originally announced December 2020.

Comments: 40 pages

arXiv:2010.13724 [pdf, ps, other]

Tight last-iterate convergence rates for no-regret learning in multi-player games

Authors: Noah Golowich, Sarath Pattathil, Constantinos Daskalakis

Abstract: We study the question of obtaining last-iterate convergence rates for no-regret learning algorithms in multi-player games. We show that the optimistic gradient (OG) algorithm with a constant step-size, which is no-regret, achieves a last-iterate rate of $O(1/\sqrt{T})$ with respect to the gap function in smooth monotone games. This result addresses a question of Mertikopoulos & Zhou (2018), who as… ▽ More We study the question of obtaining last-iterate convergence rates for no-regret learning algorithms in multi-player games. We show that the optimistic gradient (OG) algorithm with a constant step-size, which is no-regret, achieves a last-iterate rate of $O(1/\sqrt{T})$ with respect to the gap function in smooth monotone games. This result addresses a question of Mertikopoulos & Zhou (2018), who asked whether extra-gradient approaches (such as OG) can be applied to achieve improved guarantees in the multi-agent learning setting. The proof of our upper bound uses a new technique centered around an adaptive choice of potential function at each iteration. We also show that the $O(1/\sqrt{T})$ rate is tight for all $p$-SCLI algorithms, which includes OG as a special case. As a byproduct of our lower bound analysis we additionally present a proof of a conjecture of Arjevani et al. (2015) which is more direct than previous approaches. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: To appear at NeurIPS 2020. 41 pages

arXiv:2007.03668 [pdf, ps, other]

Near-tight closure bounds for Littlestone and threshold dimensions

Authors: Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi

Abstract: We study closure properties for the Littlestone and threshold dimensions of binary hypothesis classes. Given classes $\mathcal{H}_1, \ldots, \mathcal{H}_k$ of Boolean functions with bounded Littlestone (respectively, threshold) dimension, we establish an upper bound on the Littlestone (respectively, threshold) dimension of the class defined by applying an arbitrary binary aggregation rule to… ▽ More We study closure properties for the Littlestone and threshold dimensions of binary hypothesis classes. Given classes $\mathcal{H}_1, \ldots, \mathcal{H}_k$ of Boolean functions with bounded Littlestone (respectively, threshold) dimension, we establish an upper bound on the Littlestone (respectively, threshold) dimension of the class defined by applying an arbitrary binary aggregation rule to $\mathcal{H}_1, \ldots, \mathcal{H}_k$. We also show that our upper bounds are nearly tight. Our upper bounds give an exponential (in $k$) improvement upon analogous bounds shown by Alon et al. (COLT 2020), thus answering a question posed by their work. △ Less

Submitted 7 July, 2020; originally announced July 2020.

Comments: 7 pages

arXiv:2002.01919 [pdf, other]

Pure Differentially Private Summation from Anonymous Messages

Authors: Badih Ghazi, Noah Golowich, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Ameya Velingker

Abstract: The shuffled (aka anonymous) model has recently generated significant interest as a candidate distributed privacy framework with trust assumptions better than the central model but with achievable errors smaller than the local model. We study pure differentially private (DP) protocols in the shuffled model for summation, a basic and widely used primitive: - For binary summation where each of n u… ▽ More The shuffled (aka anonymous) model has recently generated significant interest as a candidate distributed privacy framework with trust assumptions better than the central model but with achievable errors smaller than the local model. We study pure differentially private (DP) protocols in the shuffled model for summation, a basic and widely used primitive: - For binary summation where each of n users holds a bit as an input, we give a pure $ε$-DP protocol for estimating the number of ones held by the users up to an error of $O_ε(1)$, and each user sends $O_ε(\log n)$ messages each of 1 bit. This is the first pure protocol in the shuffled model with error $o(\sqrt{n})$ for constant $ε$. Using this protocol, we give a pure $ε$-DP protocol that performs summation of real numbers in $[0, 1]$ up to an error of $O_ε(1)$, and where each user sends $O_ε(\log^3 n)$ messages each of $O(\log\log n)$ bits. - In contrast, we show that for any pure $ε$-DP protocol for binary summation in the shuffled model having absolute error $n^{0.5-Ω(1)}$, the per user communication has to be at least $Ω_ε(\sqrt{\log n})$ bits. This implies the first separation between the (bounded-communication) multi-message shuffled model and the central model, and the first separation between pure and approximate DP protocols in the shuffled model. To prove our lower bound, we consider (a generalization of) the following question: given $γ$ in $(0, 1)$, what is the smallest m for which there are two random variables $X^0, X^1$ supported on $\{0, \dots ,m\}$ such that (i) the total variation distance between $X^0$ and $X^1$ is at least $1-γ$, and (ii) the moment generating functions of $X^0$ and $X^1$ are within a constant factor of each other everywhere? We show that the answer is $m = Θ(\sqrt{\log(1/γ)})$. △ Less

Submitted 5 February, 2020; originally announced February 2020.

Comments: 40 pages, 3 figures

arXiv:2002.00057 [pdf, ps, other]

Last Iterate is Slower than Averaged Iterate in Smooth Convex-Concave Saddle Point Problems

Authors: Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, Asuman Ozdaglar

Abstract: In this paper we study the smooth convex-concave saddle point problem. Specifically, we analyze the last iterate convergence properties of the Extragradient (EG) algorithm. It is well known that the ergodic (averaged) iterates of EG converge at a rate of $O(1/T)$ (Nemirovski, 2004). In this paper, we show that the last iterate of EG converges at a rate of $O(1/\sqrt{T})$. To the best of our knowle… ▽ More In this paper we study the smooth convex-concave saddle point problem. Specifically, we analyze the last iterate convergence properties of the Extragradient (EG) algorithm. It is well known that the ergodic (averaged) iterates of EG converge at a rate of $O(1/T)$ (Nemirovski, 2004). In this paper, we show that the last iterate of EG converges at a rate of $O(1/\sqrt{T})$. To the best of our knowledge, this is the first paper to provide a convergence rate guarantee for the last iterate of EG for the smooth convex-concave saddle point problem. Moreover, we show that this rate is tight by proving a lower bound of $Ω(1/\sqrt{T})$ for the last iterate. This lower bound therefore shows a quadratic separation of the convergence rates of ergodic and last iterates in smooth convex-concave saddle point problems. △ Less

Submitted 6 July, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

Comments: 27 pages

arXiv:1909.00323 [pdf, other]

Round Complexity of Common Randomness Generation: The Amortized Setting

Authors: Noah Golowich, Madhu Sudan

Abstract: We study the effect of rounds of interaction on the common randomness generation (CRG) problem. In the CRG problem, two parties, Alice and Bob, receive samples $X_i$ and $Y_i$, respectively, drawn jointly from a source distribution $μ$. The two parties wish to agree on a common random key consisting of many bits of randomness, by exchanging messages that depend on each party's input and the previo… ▽ More We study the effect of rounds of interaction on the common randomness generation (CRG) problem. In the CRG problem, two parties, Alice and Bob, receive samples $X_i$ and $Y_i$, respectively, drawn jointly from a source distribution $μ$. The two parties wish to agree on a common random key consisting of many bits of randomness, by exchanging messages that depend on each party's input and the previous messages. In this work we study the amortized version of the problem, i.e., the number of bits of communication needed per random bit output by Alice and Bob, in the limit as the number of bits generated tends to infinity. The amortized version of the CRG problem has been extensively studied, though very little was known about the effect of interaction on this problem. Recently Bafna et al. (SODA 2019) considered the non-amortized version of the problem: they gave a family of sources $μ_{r,n}$ parameterized by $r,n\in\mathbb{N}$, such that with $r+2$ rounds of communication one can generate $n$ bits of common randomness with this source with $O(r\log n)$ communication, whereas with roughly $r/2$ rounds the communication complexity is $Ω(n/{\rm poly}\log n)$. Note that their source is designed with the target number of bits in mind and hence the result does not apply to the amortized setting. In this work we strengthen the work of Bafna et al. in two ways: First we show that the results extend to the amortized setting. We also reduce the gap between the round complexity in the upper and lower bounds to an additive constant. Specifically we show that for every pair $r,n \in \mathbb{N}$ the (amortized) communication complexity to generate $Ω(n)$ bits of common randomness from the source $μ_{r,n}$ using $r+2$ rounds of communication is $O(r\log n)$ whereas the amortized communication required to generate the same amount of randomness from $r$ rounds is $Ω(\sqrt n)$. △ Less

Submitted 1 September, 2019; originally announced September 2019.

Comments: 41 pages

arXiv:1908.11358 [pdf, other]

On the Power of Multiple Anonymous Messages

Authors: Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, Ameya Velingker

Abstract: An exciting new development in differential privacy is the shuffled model, in which an anonymous channel enables non-interactive, differentially private protocols with error much smaller than what is possible in the local model, while relying on weaker trust assumptions than in the central model. In this paper, we study basic counting problems in the shuffled model and establish separations betwee… ▽ More An exciting new development in differential privacy is the shuffled model, in which an anonymous channel enables non-interactive, differentially private protocols with error much smaller than what is possible in the local model, while relying on weaker trust assumptions than in the central model. In this paper, we study basic counting problems in the shuffled model and establish separations between the error that can be achieved in the single-message shuffled model and in the shuffled model with multiple messages per user. For the problem of frequency estimation for $n$ users and a domain of size $B$, we obtain: - A nearly tight lower bound of $\tildeΩ( \min(\sqrt[4]{n}, \sqrt{B}))$ on the error in the single-message shuffled model. This implies that the protocols obtained from the amplification via shuffling work of Erlingsson et al. (SODA 2019) and Balle et al. (Crypto 2019) are essentially optimal for single-message protocols. A key ingredient in the proof is a lower bound on the error of locally-private frequency estimation in the low-privacy (aka high $ε$) regime. - Protocols in the multi-message shuffled model with $poly(\log{B}, \log{n})$ bits of communication per user and $poly\log{B}$ error, which provide an exponential improvement on the error compared to what is possible with single-message algorithms. For the related selection problem on a domain of size $B$, we prove: - A nearly tight lower bound of $Ω(B)$ on the number of users in the single-message shuffled model. This significantly improves on the $Ω(B^{1/17})$ lower bound obtained by Cheu et al. (Eurocrypt 2019), and when combined with their $\tilde{O}(\sqrt{B})$-error multi-message protocol, implies the first separation between single-message and multi-message protocols for this problem. △ Less

Submitted 19 May, 2020; v1 submitted 29 August, 2019; originally announced August 2019.

Comments: 70 pages, 2 figures, 3 tables

arXiv:1810.02281 [pdf, other]

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Authors: Sanjeev Arora, Nadav Cohen, Noah Golowich, Wei Hu

Abstract: We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x \mapsto W_N W_{N-1} \cdots W_1 x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices a… ▽ More We analyze speed of convergence to global optimum for gradient descent training a deep linear neural network (parameterized as $x \mapsto W_N W_{N-1} \cdots W_1 x$) by minimizing the $\ell_2$ loss over whitened data. Convergence at a linear rate is guaranteed when the following hold: (i) dimensions of hidden layers are at least the minimum of the input and output dimensions; (ii) weight matrices at initialization are approximately balanced; and (iii) the initial loss is smaller than the loss of any rank-deficient solution. The assumptions on initialization (conditions (ii) and (iii)) are necessary, in the sense that violating any one of them may lead to convergence failure. Moreover, in the important case of output dimension 1, i.e. scalar regression, they are met, and thus convergence to global optimum holds, with constant probability under a random initialization scheme. Our results significantly extend previous analyses, e.g., of deep linear residual networks (Bartlett et al., 2018). △ Less

Submitted 26 October, 2019; v1 submitted 4 October, 2018; originally announced October 2018.

Comments: Published as a conference paper at ICLR 2019

arXiv:1808.08907 [pdf, other]

Communication-Rounds Tradeoffs for Common Randomness and Secret Key Generation

Authors: Mitali Bafna, Badih Ghazi, Noah Golowich, Madhu Sudan

Abstract: We study the role of interaction in the Common Randomness Generation (CRG) and Secret Key Generation (SKG) problems. In the CRG problem, two players, Alice and Bob, respectively get samples $X_1,X_2,\dots$ and $Y_1,Y_2,\dots$ with the pairs $(X_1,Y_1)$, $(X_2, Y_2)$, $\dots$ being drawn independently from some known probability distribution $μ$. They wish to communicate so as to agree on $L$ bits… ▽ More We study the role of interaction in the Common Randomness Generation (CRG) and Secret Key Generation (SKG) problems. In the CRG problem, two players, Alice and Bob, respectively get samples $X_1,X_2,\dots$ and $Y_1,Y_2,\dots$ with the pairs $(X_1,Y_1)$, $(X_2, Y_2)$, $\dots$ being drawn independently from some known probability distribution $μ$. They wish to communicate so as to agree on $L$ bits of randomness. The SKG problem is the restriction of the CRG problem to the case where the key is required to be close to random even to an eavesdropper who can listen to their communication (but does not have access to the inputs of Alice and Bob). In this work, we study the relationship between the amount of communication and the number of rounds of interaction in both the CRG and the SKG problems. Specifically, we construct a family of distributions $μ= μ_{r, n,L}$, parametrized by integers $r$, $n$ and $L$, such that for every $r$ there exists a constant $b = b(r)$ for which CRG (respectively SKG) is feasible when $(X_i,Y_i) \sim μ_{r,n,L}$ with $r+1$ rounds of communication, each consisting of $O(\log n)$ bits, but when restricted to $r/2 - 3$ rounds of interaction, the total communication must exceed $Ω(n/\log^{b}(n))$ bits. Prior to our work no separations were known for $r \geq 2$. △ Less

Submitted 27 August, 2018; originally announced August 2018.

Comments: 41 pages, 3 figures

arXiv:1801.02254 [pdf, other]

Theory of Deep Learning IIb: Optimization Properties of SGD

Authors: Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, Tomaso Poggio

Abstract: In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volume, "flat" minima, selecting flat minimizers which… ▽ More In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volume, "flat" minima, selecting flat minimizers which are with very high probability also global minimizers △ Less

Submitted 7 January, 2018; originally announced January 2018.

arXiv:1712.06541 [pdf, ps, other]

Size-Independent Sample Complexity of Neural Networks

Authors: Noah Golowich, Alexander Rakhlin, Ohad Shamir

Abstract: We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are… ▽ More We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest. △ Less

Submitted 17 November, 2019; v1 submitted 18 December, 2017; originally announced December 2017.

Comments: Fixed a bug in the proof of theorem 7 (not affecting theorem statement), by slightly changing the construction

Showing 1–41 of 41 results for author: Golowich, N