Skip to main content

Showing 1–21 of 21 results for author: Gleave, A

.
  1. arXiv:2406.12843  [pdf, other

    cs.LG cs.AI stat.ML

    Can Go AIs be adversarially robust?

    Authors: Tom Tseng, Euan McLean, Kellin Pelrine, Tony T. Wang, Adam Gleave

    Abstract: Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 67 pages

  2. arXiv:2402.11777  [pdf, other

    cs.CL cs.AI cs.LG

    Uncovering Latent Human Wellbeing in Language Model Embeddings

    Authors: Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons

    Abstract: Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

    Comments: 10 pages, 5 figures, 1 table

    ACM Class: I.2.7

  3. arXiv:2312.14302  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    Exploiting Novel GPT-4 APIs

    Authors: Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave

    Abstract: Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose ``gray-box'' access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 10 pages, 1 figure, 4 tables

    ACM Class: I.2.7

  4. arXiv:2309.15257  [pdf, other

    cs.LG cs.AI

    STARC: A General Framework For Quantifying Differences Between Reward Functions

    Authors: Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate

    Abstract: In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use \emph{reward learning algorithms}, which attempt to \emph{learn} a reward fun… ▽ More

    Submitted 11 March, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

  5. arXiv:2301.03652  [pdf, other

    cs.LG cs.AI

    On The Fragility of Learned Reward Functions

    Authors: Lev McKinney, Yawen Duan, David Krueger, Adam Gleave

    Abstract: Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of tr… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

    Comments: 5 pages, 2 figures, presented at the NeurIPS Deep RL and ML Safety Workshops

  6. arXiv:2211.11972  [pdf, other

    cs.LG cs.AI

    imitation: Clean Imitation Learning Implementations

    Authors: Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell

    Abstract: imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch. We include three inverse reinforcement learning (IRL) algorithms, three imitation learning algorithms and a preference comparison algorithm. The implementations have been benchmarked against previous results, and automated tests cover 98% of the code. Moreover, the algorithms are implemented in a… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

  7. arXiv:2211.00241  [pdf, other

    cs.LG cs.AI cs.CR stat.ML

    Adversarial Policies Beat Superhuman Go AIs

    Authors: Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

    Abstract: We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human exper… ▽ More

    Submitted 13 July, 2023; v1 submitted 31 October, 2022; originally announced November 2022.

    Comments: Accepted to ICML 2023, see paper for changelog

    ACM Class: I.2.6

  8. arXiv:2208.09570  [pdf, ps, other

    cs.LG

    Calculus on MDPs: Potential Sha** as a Gradient

    Authors: Erik Jenner, Herke van Hoof, Adam Gleave

    Abstract: In reinforcement learning, different reward functions can be equivalent in terms of the optimal policies they induce. A particularly well-known and important example is potential sha**, a class of functions that can be added to any reward function without changing the optimal policy set under arbitrary transition dynamics. Potential sha** is conceptually similar to potentials, conservative vec… ▽ More

    Submitted 2 December, 2022; v1 submitted 19 August, 2022; originally announced August 2022.

    Comments: Fixed mistake in proof that affected several results

  9. arXiv:2208.05083  [pdf, other

    cs.LG cs.AI cs.CR

    Reducing Exploitability with Population Based Training

    Authors: Pavel Czempin, Adam Gleave

    Abstract: Self-play reinforcement learning has achieved state-of-the-art, and often superhuman, performance in a variety of zero-sum games. Yet prior work has found that policies that are highly capable against regular opponents can fail catastrophically against adversarial policies: an opponent trained explicitly against the victim. Prior defenses using adversarial training were able to make the victim rob… ▽ More

    Submitted 11 January, 2023; v1 submitted 9 August, 2022; originally announced August 2022.

    Comments: Presented at New Frontiers in Adversarial Machine Learning Workshop, ICML 2022

  10. arXiv:2203.13553  [pdf, other

    cs.LG

    Preprocessing Reward Functions for Interpretability

    Authors: Erik Jenner, Adam Gleave

    Abstract: In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward func… ▽ More

    Submitted 25 March, 2022; originally announced March 2022.

    Comments: Presented at the NeurIPS 2021 Cooperative AI workshop. Code available at https://github.com/HumanCompatibleAI/reward-preprocessing

  11. arXiv:2203.11409  [pdf, other

    cs.LG cs.AI

    A Primer on Maximum Causal Entropy Inverse Reinforcement Learning

    Authors: Adam Gleave, Sam Toyer

    Abstract: Inverse Reinforcement Learning (IRL) algorithms infer a reward function that explains demonstrations provided by an expert acting in the environment. Maximum Causal Entropy (MCE) IRL is currently the most popular formulation of IRL, with numerous extensions. In this tutorial, we present a compressed derivation of MCE IRL and the key results from contemporary implementations of MCE IRL algorithms.… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: 29 pages

    ACM Class: I.2.6

  12. arXiv:2203.07475  [pdf, other

    cs.LG cs.AI stat.ML

    Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

    Authors: Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave

    Abstract: It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally chara… ▽ More

    Submitted 7 June, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

    Comments: ICML 2023. 9 pages main paper, 26 pages total, 3 figures

    ACM Class: I.2.6

  13. arXiv:2203.07472  [pdf, other

    cs.CL cs.AI cs.LG

    Uncertainty Estimation for Language Reward Models

    Authors: Adam Gleave, Geoffrey Irving

    Abstract: Language models can learn a range of capabilities from unsupervised training on text corpora. However, to solve a particular problem (such as text summarization) it is typically necessary to fine-tune them on a task-specific dataset. It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward mo… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

    Comments: 8 pages main paper, 17 pages total

    ACM Class: I.2.7

  14. arXiv:2012.05862  [pdf, other

    cs.LG

    Understanding Learned Reward Functions

    Authors: Eric J. Michaud, Adam Gleave, Stuart Russell

    Abstract: In many real-world tasks, it is not possible to procedurally specify an RL agent's reward function. In such cases, a reward function must instead be learned from interacting with and observing humans. However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences. Absent significant advances in reward learning, it is thus important to… ▽ More

    Submitted 10 December, 2020; originally announced December 2020.

    Comments: Presented at Deep RL Workshop, NeurIPS 2020

  15. arXiv:2012.01365  [pdf, other

    cs.LG cs.AI

    DERAIL: Diagnostic Environments for Reward And Imitation Learning

    Authors: Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell

    Abstract: The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complem… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

  16. arXiv:2006.13900  [pdf, other

    cs.LG cs.AI stat.ML

    Quantifying Differences in Reward Functions

    Authors: Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, Jan Leike

    Abstract: For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimiza… ▽ More

    Submitted 17 March, 2021; v1 submitted 24 June, 2020; originally announced June 2020.

    Comments: Published at ICLR 2021. 9 pages main paper, 42 pages total

    ACM Class: I.2.6

  17. arXiv:1905.10615  [pdf, other

    cs.LG cs.AI cs.CR stat.ML

    Adversarial Policies: Attacking Deep Reinforcement Learning

    Authors: Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell

    Abstract: Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers. However, an attacker is not usually able to directly modify another agent's observations. This might lead one to wonder: is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environ… ▽ More

    Submitted 17 January, 2021; v1 submitted 25 May, 2019; originally announced May 2019.

    Comments: Presented at ICLR 2020

    ACM Class: I.2.6

  18. arXiv:1810.10593  [pdf, other

    cs.LG cs.AI stat.ML

    Inverse reinforcement learning for video games

    Authors: Aaron Tucker, Adam Gleave, Stuart Russell

    Abstract: Deep reinforcement learning achieves superhuman performance in a range of video game environments, but requires that a designer manually specify a reward function. It is often easier to provide demonstrations of a target behavior than to design a reward function describing that behavior. Inverse reinforcement learning (IRL) algorithms can infer a reward from demonstrations in low-dimensional conti… ▽ More

    Submitted 24 October, 2018; originally announced October 2018.

    Comments: 10 pages, 4 figures. Submitted to NIPS Deep RL Workshop

    ACM Class: I.2.6

  19. arXiv:1809.03060  [pdf, other

    cs.LG cs.AI stat.ML

    Active Inverse Reward Design

    Authors: Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

    Abstract: Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the desired behavior, but this only guarantees good behavior in the training environment. We propose structuring this process as a series of queries asking the user to compare between different reward functions. Thus we can actively select queries for maximum informativeness about the true rewar… ▽ More

    Submitted 6 November, 2019; v1 submitted 9 September, 2018; originally announced September 2018.

  20. arXiv:1805.08882  [pdf, other

    cs.LG cs.AI stat.ML

    Multi-task Maximum Entropy Inverse Reinforcement Learning

    Authors: Adam Gleave, Oliver Habryka

    Abstract: Multi-task Inverse Reinforcement Learning (IRL) is the problem of inferring multiple reward functions from expert demonstrations. Prior work, built on Bayesian IRL, is unable to scale to complex environments due to computational constraints. This paper contributes a formulation of multi-task IRL in the more computationally efficient Maximum Causal Entropy (MCE) IRL framework. Experiments show our… ▽ More

    Submitted 15 July, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

    Comments: Presented at 1st Workshop on Goal Specifications for Reinforcement Learning (ICML/IJCAI/AAMAS 2018)

    ACM Class: I.2.6

  21. arXiv:1701.04047  [pdf, ps, other

    cs.IT

    Making compression algorithms for Unicode text

    Authors: Adam Gleave, Christian Steinruecken

    Abstract: The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes. While this approach works well for the single-byte ASCII encoding, it works poorly for UTF-8, where characters often span multiple bytes. Previous research has foc… ▽ More

    Submitted 15 January, 2017; originally announced January 2017.

    Comments: 10 pages