Skip to main content

Showing 1–50 of 117 results for author: Munos, R

.
  1. arXiv:2405.19107  [pdf, ps, other

    cs.LG cs.AI

    Offline Regularised Reinforcement Learning for Large Language Models Alignment

    Authors: Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot

    Abstract: The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  2. arXiv:2405.14655  [pdf, other

    cs.LG

    Multi-turn Reinforcement Learning from Preference Human Feedback

    Authors: Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos

    Abstract: Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models (LLMs) with human preferences, allowing LLMs to demonstrate remarkable abilities in various tasks. Existing methods work by emulating the preferences at the single decision (turn) level, limiting their capabilities in settings that require planning or multi-turn interactions to ach… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  3. arXiv:2405.08448  [pdf, other

    cs.LG cs.AI

    Understanding the performance gap between online and offline alignment algorithms

    Authors: Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, Will Dabney

    Abstract: Reinforcement learning from human feedback (RLHF) is the canonical framework for large language model alignment. However, rising popularity in offline alignment algorithms challenge the need for on-policy sampling in RLHF. Within the context of reward over-optimization, we start with an opening set of experiments that demonstrate the clear advantage of online methods over offline methods. This pro… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  4. arXiv:2405.04407  [pdf, ps, other

    cs.LG cs.AI

    Super-Exponential Regret for UCT, AlphaGo and Variants

    Authors: Laurent Orseau, Remi Munos

    Abstract: We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $Ω(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present… ▽ More

    Submitted 17 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

  5. arXiv:2403.08635  [pdf, other

    cs.LG cs.AI stat.ML

    Human Alignment of Large Language Models through Online Preference Optimisation

    Authors: Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

    Abstract: Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contributio… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  6. arXiv:2402.07598  [pdf, other

    cs.LG stat.ML

    Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

    Authors: Mark Rowland, Li Kevin Wenliang, Rémi Munos, Clare Lyle, Yunhao Tang, Will Dabney

    Abstract: We propose a new algorithm for model-based distributional reinforcement learning (RL), and prove that it is minimax-optimal for approximating return distributions with a generative model (up to logarithmic factors), resolving an open question of Zhang et al. (2023). Our analysis provides new theoretical results on categorical approaches to distributional RL, and also introduces a new distributiona… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  7. arXiv:2402.05766  [pdf, other

    cs.LG

    Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling

    Authors: Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney

    Abstract: We introduce off-policy distributional Q($λ$), a new addition to the family of off-policy distributional evaluation algorithms. Off-policy distributional Q($λ$) does not apply importance sampling for off-policy learning, which introduces intriguing interactions with signed measures. Such unique properties distributional Q($λ$) from other existing alternatives such as distributional Retrace. We cha… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  8. arXiv:2402.05749  [pdf, other

    cs.LG cs.AI

    Generalized Preference Optimization: A Unified Approach to Offline Alignment

    Authors: Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, Bilal Piot

    Abstract: Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC a… ▽ More

    Submitted 28 May, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: Accepted at ICML 2023 main conference

  9. arXiv:2312.00886  [pdf, other

    stat.ML cs.AI cs.GT cs.LG cs.MA

    Nash Learning from Human Feedback

    Authors: Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot

    Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to… ▽ More

    Submitted 11 June, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

  10. arXiv:2310.18186  [pdf, other

    stat.ML cs.LG

    Model-free Posterior Sampling via Learning Rate Randomization

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, Pierre Menard

    Abstract: In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieve… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: NeurIPS-2023

  11. arXiv:2310.12036  [pdf, other

    cs.AI cs.LG stat.ML

    A General Theoretical Paradigm to Understand Learning from Human Preferences

    Authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

    Abstract: The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direc… ▽ More

    Submitted 21 November, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

  12. arXiv:2309.00656  [pdf, other

    cs.GT cs.LG stat.ML

    Local and adaptive mirror descents in extensive-form games

    Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

    Abstract: We study how to learn $ε$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback. In this setting, players update their policies sequentially based on their observations over a fixed number of episodes, denoted by $T$. Existing procedures suffer from high variance due to the use of importance sampling over sequences of actions (Steinberger et al., 2020; McAleer e… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  13. arXiv:2305.18501  [pdf, other

    cs.LG

    DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

    Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, Rémi Munos, Bernardo Ávila Pires, Michal Valko

    Abstract: Multi-step learning applies lookahead over multiple time steps and has proved valuable in policy evaluation settings. However, in the optimal control case, the impact of multi-step learning has been relatively limited despite a number of prior efforts. Fundamentally, this might be because multi-step policy improvements require operations that cannot be approximated by stochastic samples, hence hin… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  14. arXiv:2305.18491  [pdf, other

    cs.LG

    Towards a Better Understanding of Representation Dynamics under TD-learning

    Authors: Yunhao Tang, Rémi Munos

    Abstract: TD-learning is a foundation reinforcement learning (RL) algorithm for value prediction. Critical to the accuracy of value predictions is the quality of state representations. In this work, we consider the question: how does end-to-end TD-learning impact the representation over time? Complementary to prior work, we provide a set of analysis that sheds further light on the representation dynamics un… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  15. arXiv:2305.18388  [pdf, other

    cs.LG stat.ML

    The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

    Authors: Mark Rowland, Yunhao Tang, Clare Lyle, Rémi Munos, Marc G. Bellemare, Will Dabney

    Abstract: We study the problem of temporal-difference-based policy evaluation in reinforcement learning. In particular, we analyse the use of a distributional reinforcement learning algorithm, quantile temporal-difference learning (QTD), for this task. We reach the surprising conclusion that even if a practitioner has no interest in the return distribution beyond the mean, QTD (which learns predictions abou… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: ICML 2023

  16. arXiv:2305.18161  [pdf, other

    cs.LG

    VA-learning as a more efficient alternative to Q-learning

    Authors: Yunhao Tang, Rémi Munos, Mark Rowland, Michal Valko

    Abstract: In reinforcement learning, the advantage function is critical for policy improvement, but is often extracted from a learned Q-function. A natural question is: Why not learn the advantage function directly? In this work, we introduce VA-learning, which directly learns advantage function and value function using bootstrap**, without explicit reference to Q-functions. VA-learning learns off-policy… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

  17. arXiv:2305.13185  [pdf, other

    cs.LG

    Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

    Authors: Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, **cheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári, Wataru Kumagai, Yutaka Matsuo

    Abstract: Mirror descent value iteration (MDVI), an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL), has served as the basis for recent high-performing practical RL algorithms. However, despite the use of function approximation in practice, the theoretical understanding of MDVI has been limited to tabular Markov decision processes (MDPs). We study MDVI with linear fu… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: ICML 2023 accepted

  18. arXiv:2305.00654  [pdf, other

    cs.LG cs.AI

    Representations and Exploration for Deep Reinforcement Learning using Singular Value Decomposition

    Authors: Yash Chandak, Shantanu Thakoor, Zhaohan Daniel Guo, Yunhao Tang, Remi Munos, Will Dabney, Diana L Borsa

    Abstract: Representation learning and exploration are among the key challenges for any deep reinforcement learning agent. In this work, we provide a singular value decomposition based method that can be used to obtain representations that preserve the underlying transition structure in the domain. Perhaps interestingly, we show that these representations also capture the relative frequency of state visitati… ▽ More

    Submitted 2 May, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

    Comments: Accepted at the 40th International Conference on Machine Learning (ICML 2023)

  19. arXiv:2303.08059  [pdf, other

    stat.ML cs.LG

    Fast Rates for Maximum Entropy Exploration

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, Pierre Menard

    Abstract: We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards. In this work, we study the maximum entropy exploration problem of two different types. The first type is visitation entropy maximization previously considered by Hazan et al.(2019) in the discounted setting. For this type of exploration, we propose a g… ▽ More

    Submitted 6 June, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: ICML-2023

  20. arXiv:2301.04462  [pdf, other

    cs.LG stat.ML

    An Analysis of Quantile Temporal-Difference Learning

    Authors: Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney

    Abstract: We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic appro… ▽ More

    Submitted 20 May, 2024; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: Accepted to JMLR

  21. arXiv:2212.12567  [pdf, other

    stat.ML cs.LG

    Adapting to game trees in zero-sum imperfect information games

    Authors: Côme Fiegel, Pierre Ménard, Tadashi Kozuno, Rémi Munos, Vianney Perchet, Michal Valko

    Abstract: Imperfect information games (IIG) are games in which each player only partially observes the current game state. We study how to learn $ε$-optimal strategies in a zero-sum IIG through self-play with trajectory feedback. We give a problem-independent lower bound $\widetilde{\mathcal{O}}(H(A_{\mathcal{X}}+B_{\mathcal{Y}})/ε^2)$ on the required number of realizations to learn these strategies with hi… ▽ More

    Submitted 15 February, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

  22. arXiv:2212.03319  [pdf, other

    cs.LG cs.AI

    Understanding Self-Predictive Learning for Reinforcement Learning

    Authors: Yunhao Tang, Zhaohan Daniel Guo, Pierre Harvey Richemond, Bernardo Ávila Pires, Yash Chandak, Rémi Munos, Mark Rowland, Mohammad Gheshlaghi Azar, Charline Le Lan, Clare Lyle, András György, Shantanu Thakoor, Will Dabney, Bilal Piot, Daniele Calandriello, Michal Valko

    Abstract: We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirabl… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

  23. arXiv:2211.10515  [pdf, other

    stat.ML cs.LG

    Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments

    Authors: Daniel Jarrett, Corentin Tallec, Florent Altché, Thomas Mesnard, Rémi Munos, Michal Valko

    Abstract: Consider the problem of exploration in sparse-reward or reward-free environments, such as in Montezuma's Revenge. In the curiosity-driven paradigm, the agent is rewarded for how much each realized outcome differs from their predicted outcome. But using predictive error as intrinsic motivation is fragile in stochastic environments, as the agent may become trapped by high-entropy areas of the state-… ▽ More

    Submitted 14 July, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

    Journal ref: In Proc. 40th International Conference on Machine Learning (ICML 2023)

  24. arXiv:2209.14414  [pdf, other

    stat.ML cs.LG

    Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

    Authors: Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Mark Rowland, Michal Valko, Pierre Menard

    Abstract: We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of poste… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.07704

  25. arXiv:2207.07570  [pdf, other

    cs.LG

    The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning

    Authors: Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney, Marc G. Bellemare

    Abstract: We study the multi-step off-policy learning approach to distributional RL. Despite the apparent similarity between value-based RL and distributional RL, our study reveals intriguing and fundamental differences between the two cases in the multi-step setting. We identify a novel notion of path-dependent distributional TD error, which is indispensable for principled multi-step distributional RL. The… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

  26. arXiv:2206.15378  [pdf, other

    cs.AI cs.GT cs.MA

    Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning

    Authors: Julien Perolat, Bart de Vylder, Daniel Hennes, Eugene Tarassov, Florian Strub, Vincent de Boer, Paul Muller, Jerome T. Connor, Neil Burch, Thomas Anthony, Stephen McAleer, Romuald Elie, Sarah H. Cen, Zhe Wang, Audrunas Gruslys, Aleksandra Malysheva, Mina Khan, Sherjil Ozair, Finbarr Timbers, Toby Pohlen, Tom Eccles, Mark Rowland, Marc Lanctot, Jean-Baptiste Lespiau, Bilal Piot , et al. (9 additional authors not shown)

    Abstract: We introduce DeepNash, an autonomous agent capable of learning to play the imperfect information game Stratego from scratch, up to a human expert level. Stratego is one of the few iconic board games that Artificial Intelligence (AI) has not yet mastered. This popular game has an enormous game tree on the order of $10^{535}$ nodes, i.e., $10^{175}$ times larger than that of Go. It has the additiona… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

  27. arXiv:2206.08736  [pdf, other

    stat.ML cs.LG

    Generalised Policy Improvement with Geometric Policy Composition

    Authors: Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, Rémi Munos, André Barreto

    Abstract: We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate… ▽ More

    Submitted 17 June, 2022; originally announced June 2022.

    Comments: ICML 2022

  28. arXiv:2206.08332  [pdf, other

    cs.LG cs.AI stat.ML

    BYOL-Explore: Exploration by Bootstrapped Prediction

    Authors: Zhaohan Daniel Guo, Shantanu Thakoor, Miruna Pîslar, Bernardo Avila Pires, Florent Altché, Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Rémi Munos, Mohammad Gheshlaghi Azar, Bilal Piot

    Abstract: We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challeng… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  29. arXiv:2205.14211  [pdf, other

    cs.LG cs.AI stat.ML

    KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal

    Authors: Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, **cheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Michal Valko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári

    Abstract: In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for fi… ▽ More

    Submitted 27 May, 2022; originally announced May 2022.

    Comments: 29 pages, 6 figures

  30. arXiv:2203.16177  [pdf, other

    cs.LG

    Marginalized Operators for Off-policy Reinforcement Learning

    Authors: Yunhao Tang, Mark Rowland, Rémi Munos, Michal Valko

    Abstract: In this work, we propose marginalized operators, a new class of off-policy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multi-step operators, such as Retrace, as special cases. Marginalized operators also suggest a form of sample-based estimates with potential variance reduction, compared to sample-based estimates of the original multi-step op… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted at AISTATS 2022

  31. arXiv:2106.13125  [pdf, other

    cs.LG

    Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation

    Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Rémi Munos, Michal Valko

    Abstract: Model-agnostic meta-reinforcement learning requires estimating the Hessian matrix of value functions. This is challenging from an implementation perspective, as repeatedly differentiating policy gradient estimates may lead to biased Hessian estimates. In this work, we provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation. Our framew… ▽ More

    Submitted 3 November, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

    Comments: Accepted at Neural Information Processing Systems (NeurIPS), 2021. Code is available at https://github.com/robintyh1/neurips2021-meta-gradient-offpolicy-evaluation

  32. arXiv:2106.06279  [pdf, ps, other

    stat.ML cs.LG

    Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

    Authors: Tadashi Kozuno, Pierre Ménard, Rémi Munos, Michal Valko

    Abstract: We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a g… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: 20 pages

  33. arXiv:2106.06170  [pdf, other

    cs.LG

    Taylor Expansion of Discount Factors

    Authors: Yunhao Tang, Mark Rowland, Rémi Munos, Michal Valko

    Abstract: In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective. In this work, we study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors. Our analysis suggests new ways for… ▽ More

    Submitted 14 June, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted at International Conference of Machine Learning (ICML), 2021

  34. arXiv:2106.03787  [pdf, other

    cs.LG cs.MA

    Concave Utility Reinforcement Learning: the Mean-Field Game Viewpoint

    Authors: Matthieu Geist, Julien Pérolat, Mathieu Laurière, Romuald Elie, Sarah Perrin, Olivier Bachem, Rémi Munos, Olivier Pietquin

    Abstract: Concave Utility Reinforcement Learning (CURL) extends RL from linear to concave utilities in the occupancy measure induced by the agent's policy. This encompasses not only RL but also imitation learning and exploration, among others. Yet, this more general paradigm invalidates the classical Bellman equations, and calls for new algorithms. Mean-field Games (MFGs) are a continuous approximation of m… ▽ More

    Submitted 16 February, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: AAMAS 2022

  35. arXiv:2103.00107  [pdf, other

    cs.LG cs.AI stat.ML

    Revisiting Peng's Q($λ$) for Modern Reinforcement Learning

    Authors: Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, David Abel

    Abstract: Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonethel… ▽ More

    Submitted 26 February, 2021; originally announced March 2021.

    Comments: 26 pages, 7 figures, 2 tables

  36. arXiv:2102.06514  [pdf, other

    cs.LG cs.SI stat.ML

    Large-Scale Representation Learning on Graphs via Bootstrap**

    Authors: Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L. Dyer, Rémi Munos, Petar Veličković, Michal Valko

    Abstract: Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootst… ▽ More

    Submitted 20 February, 2023; v1 submitted 12 February, 2021; originally announced February 2021.

    Comments: Published as a conference paper at ICLR 2022

  37. arXiv:2101.02055  [pdf, other

    cs.LG

    Geometric Entropic Exploration

    Authors: Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Alaa Saade, Shantanu Thakoor, Bilal Piot, Bernardo Avila Pires, Michal Valko, Thomas Mesnard, Tor Lattimore, Rémi Munos

    Abstract: Exploration is essential for solving complex Reinforcement Learning (RL) tasks. Maximum State-Visitation Entropy (MSVE) formulates the exploration problem as a well-defined policy optimization problem whose solution aims at visiting all states as uniformly as possible. This is in contrast to standard uncertainty-based approaches where exploration is transient and eventually vanishes. However, exis… ▽ More

    Submitted 7 January, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

  38. arXiv:2011.09464  [pdf, other

    cs.LG

    Counterfactual Credit Assignment in Model-Free Reinforcement Learning

    Authors: Thomas Mesnard, Théophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur Guez, Éric Moulines, Marcus Hutter, Lars Buesing, Rémi Munos

    Abstract: Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to… ▽ More

    Submitted 14 December, 2021; v1 submitted 18 November, 2020; originally announced November 2020.

  39. arXiv:2011.09192  [pdf, other

    cs.AI cs.GT cs.MA

    Game Plan: What AI can do for Football, and What Football can do for AI

    Authors: Karl Tuyls, Shayegan Omidshafiei, Paul Muller, Zhe Wang, Jerome Connor, Daniel Hennes, Ian Graham, William Spearman, Tim Waskett, Dafydd Steele, Pauline Luc, Adria Recasens, Alexandre Galashov, Gregory Thornton, Romuald Elie, Pablo Sprechmann, Pol Moreno, Kris Cao, Marta Garnelo, Praneet Dutta, Michal Valko, Nicolas Heess, Alex Bridgland, Julien Perolat, Bart De Vylder , et al. (11 additional authors not shown)

    Abstract: The rapid progress in artificial intelligence (AI) and machine learning has opened unprecedented analytics possibilities in various team and individual sports, including baseball, basketball, and tennis. More recently, AI techniques have been applied to football, due to a huge increase in data collection by professional teams, increased computational power, and advances in machine learning, with t… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

  40. arXiv:2008.12234  [pdf, other

    cs.AI cs.LG

    The Advantage Regret-Matching Actor-Critic

    Authors: Audrūnas Gruslys, Marc Lanctot, Rémi Munos, Finbarr Timbers, Martin Schmid, Julien Perolat, Dustin Morrill, Vinicius Zambaldi, Jean-Baptiste Lespiau, John Schultz, Mohammad Gheshlaghi Azar, Michael Bowling, Karl Tuyls

    Abstract: Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

  41. arXiv:2007.12509  [pdf, other

    cs.LG stat.ML

    Monte-Carlo Tree Search as Regularized Policy Optimization

    Authors: Jean-Bastien Grill, Florent Altché, Yunhao Tang, Thomas Hubert, Michal Valko, Ioannis Antonoglou, Rémi Munos

    Abstract: The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero's search heuristics, along with other common ones such as UCT, are an approxima… ▽ More

    Submitted 24 July, 2020; originally announced July 2020.

    Comments: Accepted to International Conference on Machine Learning (ICML), 2020

  42. arXiv:2006.07733  [pdf, other

    cs.LG cs.CV stat.ML

    Bootstrap your own latent: A new approach to self-supervised Learning

    Authors: Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko

    Abstract: We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the… ▽ More

    Submitted 10 September, 2020; v1 submitted 13 June, 2020; originally announced June 2020.

  43. Navigating the Landscape of Multiplayer Games

    Authors: Shayegan Omidshafiei, Karl Tuyls, Wojciech M. Czarnecki, Francisco C. Santos, Mark Rowland, Jerome Connor, Daniel Hennes, Paul Muller, Julien Perolat, Bart De Vylder, Audrunas Gruslys, Remi Munos

    Abstract: Multiplayer games have long been used as testbeds in artificial intelligence research, aptly referred to as the Drosophila of artificial intelligence. Traditionally, researchers have focused on using well-known games to build strong agents. This progress, however, can be better informed by characterizing games and their topological landscape. Tackling this latter question can facilitate understand… ▽ More

    Submitted 17 November, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

  44. arXiv:2004.14646  [pdf, other

    cs.LG cs.AI

    Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning

    Authors: Daniel Guo, Bernardo Avila Pires, Bilal Piot, Jean-bastien Grill, Florent Altché, Rémi Munos, Mohammad Gheshlaghi Azar

    Abstract: Learning a good representation is an essential component for deep reinforcement learning (RL). Representation learning is especially important in multitask and partially observable settings where building a representation of the unknown environment is crucial to solve the tasks. Here we introduce Prediction of Bootstrap Latents (PBL), a simple and flexible self-supervised representation learning a… ▽ More

    Submitted 30 April, 2020; originally announced April 2020.

  45. arXiv:2003.14089  [pdf, other

    cs.LG stat.ML

    Leverage the Average: an Analysis of KL Regularization in RL

    Authors: Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, Matthieu Geist

    Abstract: Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a ve… ▽ More

    Submitted 6 January, 2021; v1 submitted 31 March, 2020; originally announced March 2020.

    Comments: NeurIPS 2020

  46. arXiv:2003.06259  [pdf, other

    cs.LG stat.ML

    Taylor Expansion Policy Optimization

    Authors: Yunhao Tang, Michal Valko, Rémi Munos

    Abstract: In this work, we investigate the application of Taylor expansions in reinforcement learning. In particular, we propose Taylor expansion policy optimization, a policy optimization formalism that generalizes prior work (e.g., TRPO) as a first-order special case. We also show that Taylor expansions intimately relate to off-policy evaluation. Finally, we show that this new formulation entails modifica… ▽ More

    Submitted 13 March, 2020; originally announced March 2020.

  47. arXiv:2002.08456  [pdf, other

    cs.GT cs.LG stat.ML

    From Poincaré Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via Regularization

    Authors: Julien Perolat, Remi Munos, Jean-Baptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro Ortega, Neil Burch, Thomas Anthony, David Balduzzi, Bart De Vylder, Georgios Piliouras, Marc Lanctot, Karl Tuyls

    Abstract: In this paper we investigate the Follow the Regularized Leader dynamics in sequential imperfect information games (IIG). We generalize existing results of Poincaré recurrence from normal-form games to zero-sum two-player imperfect information games and other sequential game settings. We then investigate how adapting the reward (by adding a regularization term) of the game can give strong convergen… ▽ More

    Submitted 19 February, 2020; originally announced February 2020.

    Comments: 43 pages

  48. arXiv:1912.02503  [pdf, other

    cs.LG stat.ML

    Hindsight Credit Assignment

    Authors: Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, Remi Munos

    Abstract: We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions… ▽ More

    Submitted 5 December, 2019; originally announced December 2019.

    Comments: NeurIPS 2019

  49. arXiv:1910.07479  [pdf, other

    cs.LG stat.ML

    Conditional Importance Sampling for Off-Policy Learning

    Authors: Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, Will Dabney

    Abstract: The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms th… ▽ More

    Submitted 30 July, 2020; v1 submitted 16 October, 2019; originally announced October 2019.

    Comments: AISTATS 2020 camera-ready version

  50. arXiv:1910.07478  [pdf, other

    cs.LG stat.ML

    Adaptive Trade-Offs in Off-Policy Learning

    Authors: Mark Rowland, Will Dabney, Rémi Munos

    Abstract: A great variety of off-policy learning algorithms exist in the literature, and new breakthroughs in this area continue to be made, improving theoretical understanding and yielding state-of-the-art reinforcement learning algorithms. In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, an… ▽ More

    Submitted 30 July, 2020; v1 submitted 16 October, 2019; originally announced October 2019.

    Comments: AISTATS 2020 camera-ready version