Skip to main content

Showing 1–19 of 19 results for author: Azar, M G

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.01660  [pdf, other

    cs.LG cs.AI stat.ML

    Self-Improving Robust Preference Optimization

    Authors: Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar

    Abstract: Both online and offline RLHF methods such as PPO and DPO have been extremely successful in aligning AI with human preferences. Despite their success, the existing methods suffer from a fundamental problem that their optimal solution is highly task-dependent (i.e., not robust to out-of-distribution (OOD) tasks). Here we address this challenge by proposing Self-Improving Robust Preference Optimizati… ▽ More

    Submitted 7 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  2. arXiv:2312.00886  [pdf, other

    stat.ML cs.AI cs.GT cs.LG cs.MA

    Nash Learning from Human Feedback

    Authors: Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot

    Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to… ▽ More

    Submitted 11 June, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

  3. arXiv:2310.12036  [pdf, other

    cs.AI cs.LG stat.ML

    A General Theoretical Paradigm to Understand Learning from Human Preferences

    Authors: Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

    Abstract: The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy. Recently, Direc… ▽ More

    Submitted 21 November, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

  4. arXiv:2301.04462  [pdf, other

    cs.LG stat.ML

    An Analysis of Quantile Temporal-Difference Learning

    Authors: Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney

    Abstract: We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic appro… ▽ More

    Submitted 20 May, 2024; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: Accepted to JMLR

  5. arXiv:2206.08332  [pdf, other

    cs.LG cs.AI stat.ML

    BYOL-Explore: Exploration by Bootstrapped Prediction

    Authors: Zhaohan Daniel Guo, Shantanu Thakoor, Miruna Pîslar, Bernardo Avila Pires, Florent Altché, Corentin Tallec, Alaa Saade, Daniele Calandriello, Jean-Bastien Grill, Yunhao Tang, Michal Valko, Rémi Munos, Mohammad Gheshlaghi Azar, Bilal Piot

    Abstract: We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challeng… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  6. arXiv:2205.14211  [pdf, other

    cs.LG cs.AI stat.ML

    KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal

    Authors: Tadashi Kozuno, Wenhao Yang, Nino Vieillard, Toshinori Kitamura, Yunhao Tang, **cheng Mei, Pierre Ménard, Mohammad Gheshlaghi Azar, Michal Valko, Rémi Munos, Olivier Pietquin, Matthieu Geist, Csaba Szepesvári

    Abstract: In this work, we consider and analyze the sample complexity of model-free reinforcement learning with a generative model. Particularly, we analyze mirror descent value iteration (MDVI) by Geist et al. (2019) and Vieillard et al. (2020a), which uses the Kullback-Leibler divergence and entropy regularization in its value and policy updates. Our analysis shows that it is nearly minimax-optimal for fi… ▽ More

    Submitted 27 May, 2022; originally announced May 2022.

    Comments: 29 pages, 6 figures

  7. arXiv:2102.06514  [pdf, other

    cs.LG cs.SI stat.ML

    Large-Scale Representation Learning on Graphs via Bootstrap**

    Authors: Shantanu Thakoor, Corentin Tallec, Mohammad Gheshlaghi Azar, Mehdi Azabou, Eva L. Dyer, Rémi Munos, Petar Veličković, Michal Valko

    Abstract: Self-supervised learning provides a promising path towards eliminating the need for costly label information in representation learning on graphs. However, to achieve state-of-the-art performance, methods often need large numbers of negative examples and rely on complex augmentations. This can be prohibitively expensive, especially for large graphs. To address these challenges, we introduce Bootst… ▽ More

    Submitted 20 February, 2023; v1 submitted 12 February, 2021; originally announced February 2021.

    Comments: Published as a conference paper at ICLR 2022

  8. arXiv:2006.07733  [pdf, other

    cs.LG cs.CV stat.ML

    Bootstrap your own latent: A new approach to self-supervised Learning

    Authors: Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko

    Abstract: We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the… ▽ More

    Submitted 10 September, 2020; v1 submitted 13 June, 2020; originally announced June 2020.

  9. arXiv:1902.07685  [pdf, other

    cs.AI stat.AP stat.ML

    World Discovery Models

    Authors: Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo Avila Pires, Jean-Bastien Grill, Florent Altché, Rémi Munos

    Abstract: As humans we are driven by a strong desire for seeking novelty in our world. Also upon observing a novel pattern we are capable of refining our understanding of the world based on the new information---humans can discover their world. The outstanding ability of the human mind for discovery has led to many breakthroughs in science, art and technology. Here we investigate the possibility of building… ▽ More

    Submitted 1 March, 2019; v1 submitted 20 February, 2019; originally announced February 2019.

  10. arXiv:1811.06407  [pdf, other

    cs.LG stat.ML

    Neural Predictive Belief Representations

    Authors: Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Bernardo A. Pires, Rémi Munos

    Abstract: Unsupervised representation learning has succeeded with excellent results in many applications. It is an especially powerful tool to learn a good representation of environments with partial or noisy observations. In partially observable domains it is important for the representation to encode a belief state, a sufficient statistic of the observations seen so far. In this paper, we investigate whet… ▽ More

    Submitted 19 August, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

  11. arXiv:1805.11593  [pdf, other

    cs.LG cs.AI stat.ML

    Observe and Look Further: Achieving Consistent Performance on Atari

    Authors: Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Večerík, Matteo Hessel, Rémi Munos, Olivier Pietquin

    Abstract: Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and explori… ▽ More

    Submitted 29 May, 2018; originally announced May 2018.

  12. arXiv:1706.10295  [pdf, other

    cs.LG stat.ML

    Noisy Networks for Exploration

    Authors: Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg

    Abstract: We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find… ▽ More

    Submitted 9 July, 2019; v1 submitted 30 June, 2017; originally announced June 2017.

    Comments: ICLR 2018

  13. arXiv:1703.05449  [pdf, ps, other

    stat.ML cs.AI cs.LG

    Minimax Regret Bounds for Reinforcement Learning

    Authors: Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos

    Abstract: We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde{O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous… ▽ More

    Submitted 1 July, 2017; v1 submitted 15 March, 2017; originally announced March 2017.

  14. arXiv:1602.02191  [pdf, other

    stat.ML cs.LG

    Convex Relaxation Regression: Black-Box Optimization of Smooth Functions by Learning Their Convex Envelopes

    Authors: Mohammad Gheshlaghi Azar, Eva Dyer, Konrad Kording

    Abstract: Finding efficient and provable methods to solve non-convex optimization problems is an outstanding challenge in machine learning and optimization theory. A popular approach used to tackle non-convex problems is to use convex relaxation techniques to find a convex surrogate for the problem. Unfortunately, convex relaxations typically must be found on a problem-by-problem basis. Thus, providing a ge… ▽ More

    Submitted 3 March, 2016; v1 submitted 5 February, 2016; originally announced February 2016.

    Journal ref: Proc. of the Conference on Uncertainty in Artificial Intelligence, pg. 22-31, 2016

  15. arXiv:1402.0562  [pdf, ps, other

    stat.ML cs.LG eess.SY

    Online Stochastic Optimization under Correlated Bandit Feedback

    Authors: Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill

    Abstract: In this paper we consider the problem of online stochastic optimization of a locally smooth function under bandit feedback. We introduce the high-confidence tree (HCT) algorithm, a novel any-time $\mathcal{X}$-armed bandit algorithm, and derive regret bounds matching the performance of existing state-of-the-art in terms of dependency on number of steps and smoothness factor. The main advantage of… ▽ More

    Submitted 19 May, 2014; v1 submitted 3 February, 2014; originally announced February 2014.

  16. arXiv:1307.6887  [pdf, ps, other

    stat.ML cs.LG

    Sequential Transfer in Multi-armed Bandit with Finite Set of Models

    Authors: Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill

    Abstract: Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of \textit{sequential tra… ▽ More

    Submitted 25 July, 2013; originally announced July 2013.

  17. arXiv:1305.1027  [pdf, ps, other

    stat.ML cs.LG

    Regret Bounds for Reinforcement Learning with Policy Advice

    Authors: Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill

    Abstract: In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. We prove that RLPA has a sub-linear regret of \ti… ▽ More

    Submitted 17 July, 2013; v1 submitted 5 May, 2013; originally announced May 2013.

  18. arXiv:1206.6461  [pdf

    cs.LG stat.ML

    On the Sample Complexity of Reinforcement Learning with a Generative Model

    Authors: Mohammad Gheshlaghi Azar, Remi Munos, Bert Kappen

    Abstract: We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor γ\in[0,1) only O(N\log(N/δ)/((1-γ)^3ε^2)) samples are req… ▽ More

    Submitted 27 June, 2012; originally announced June 2012.

    Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

  19. arXiv:1004.2027  [pdf, ps, other

    cs.LG cs.AI eess.SY math.OC stat.ML

    Dynamic Policy Programming

    Authors: Mohammad Gheshlaghi Azar, Vicenc Gomez, Hilbert J. Kappen

    Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumula… ▽ More

    Submitted 6 September, 2011; v1 submitted 12 April, 2010; originally announced April 2010.

    Comments: Submitted to Journal of Machine Learning Research