Skip to main content

Showing 1–11 of 11 results for author: Toyer, S

.
  1. arXiv:2402.10260  [pdf, other

    cs.LG cs.CL cs.CR

    A StrongREJECT for Empty Jailbreaks

    Authors: Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer

    Abstract: The rise of large language models (LLMs) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. However, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. We show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased tow… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: Code and data at https://github.com/alexandrasouly/strongreject

  2. arXiv:2311.01011  [pdf, other

    cs.LG cs.CR

    Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

    Authors: Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell

    Abstract: While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by p… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  3. arXiv:2211.11972  [pdf, other

    cs.LG cs.AI

    imitation: Clean Imitation Learning Implementations

    Authors: Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell

    Abstract: imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch. We include three inverse reinforcement learning (IRL) algorithms, three imitation learning algorithms and a preference comparison algorithm. The implementations have been benchmarked against previous results, and automated tests cover 98% of the code. Moreover, the algorithms are implemented in a… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

  4. arXiv:2205.07886  [pdf, other

    cs.LG cs.AI

    An Empirical Investigation of Representation Learning for Imitation

    Authors: Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, ** Luo, Stuart Russell, Pieter Abbeel, Rohin Shah

    Abstract: Imitation learning often needs a large demonstration set in order to handle the full range of situations that an agent might find itself in during deployment. However, collecting expert demonstrations can be expensive. Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: Accepted to NeurIPS2021 Datasets and Benchmarks Track

  5. arXiv:2203.11409  [pdf, other

    cs.LG cs.AI

    A Primer on Maximum Causal Entropy Inverse Reinforcement Learning

    Authors: Adam Gleave, Sam Toyer

    Abstract: Inverse Reinforcement Learning (IRL) algorithms infer a reward function that explains demonstrations provided by an expert acting in the environment. Maximum Causal Entropy (MCE) IRL is currently the most popular formulation of IRL, with numerous extensions. In this tutorial, we present a compressed derivation of MCE IRL and the key results from contemporary implementations of MCE IRL algorithms.… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: 29 pages

    ACM Class: I.2.6

  6. arXiv:2012.01365  [pdf, other

    cs.LG cs.AI

    DERAIL: Diagnostic Environments for Reward And Imitation Learning

    Authors: Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell

    Abstract: The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complem… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

  7. arXiv:2011.00401  [pdf, other

    cs.LG cs.AI

    The MAGICAL Benchmark for Robust Imitation

    Authors: Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell

    Abstract: Imitation Learning (IL) algorithms are typically evaluated in the same environment that was used to create demonstrations. This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings. This paper presents the MAGICAL benchmark… ▽ More

    Submitted 31 October, 2020; originally announced November 2020.

    Comments: NeurIPS 2020 conference paper (poster)

  8. ASNets: Deep Learning for Generalised Planning

    Authors: Sam Toyer, Felipe Trevizan, Sylvie Thiébaux, Lexing Xie

    Abstract: In this paper, we discuss the learning of generalised policies for probabilistic and classical planning problems using Action Schema Networks (ASNets). The ASNet is a neural network architecture that exploits the relational structure of (P)PDDL planning problems to learn a common set of weights that can be applied to any problem in a domain. By mimicking the actions chosen by a traditional, non-le… ▽ More

    Submitted 5 May, 2020; v1 submitted 4 August, 2019; originally announced August 2019.

    Comments: Journal extension of AAAI'18 paper (arXiv:1709.04271)

    Journal ref: Journal of Artificial Intelligence Research 68 (2020) 1-68

  9. arXiv:1810.00821  [pdf, other

    cs.LG stat.ML

    Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

    Authors: Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, Sergey Levine

    Abstract: Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to c… ▽ More

    Submitted 24 August, 2020; v1 submitted 1 October, 2018; originally announced October 2018.

  10. arXiv:1709.04271  [pdf, other

    cs.AI cs.LG

    Action Schema Networks: Generalised Policies with Deep Learning

    Authors: Sam Toyer, Felipe Trevizan, Sylvie Thiébaux, Lexing Xie

    Abstract: In this paper, we introduce the Action Schema Network (ASNet): a neural network architecture for learning generalised policies for probabilistic planning problems. By mimicking the relational structure of planning problems, ASNets are able to adopt a weight-sharing scheme which allows the network to be applied to any problem from a given planning domain. This allows the cost of training the networ… ▽ More

    Submitted 22 December, 2017; v1 submitted 13 September, 2017; originally announced September 2017.

    Comments: Accepted to AAAI 2018

  11. arXiv:1707.09240  [pdf, other

    cs.CV cs.LG

    Human Pose Forecasting via Deep Markov Models

    Authors: Sam Toyer, Anoop Cherian, Tengda Han, Stephen Gould

    Abstract: Human pose forecasting is an important problem in computer vision with applications to human-robot interaction, visual surveillance, and autonomous driving. Usually, forecasting algorithms use 3D skeleton sequences and are trained to forecast for a few milliseconds into the future. Long-range forecasting is challenging due to the difficulty of estimating how long a person continues an activity. To… ▽ More

    Submitted 5 September, 2017; v1 submitted 24 July, 2017; originally announced July 2017.

    Comments: Accepted to DICTA'17