Skip to main content

Showing 1–50 of 97 results for author: Russell, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19501  [pdf, other

    cs.CL cs.LG

    Monitoring Latent World States in Language Models with Propositional Probes

    Authors: Jiahai Feng, Stuart Russell, Jacob Steinhardt

    Abstract: Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. W… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2406.00877  [pdf, other

    cs.LG cs.AI

    Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

    Authors: Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell

    Abstract: Do neural networks learn to implement algorithms such as look-ahead or search "in the wild"? Or do they rely purely on collections of simple heuristics? We present evidence of learned look-ahead in the policy network of Leela Chess Zero, the currently strongest neural chess engine. We find that Leela internally represents future optimal moves and that these representations are crucial for its fina… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: Project page: https://leela-interp.github.io/

  3. arXiv:2405.20519  [pdf, other

    cs.AI

    Diffusion On Syntax Trees For Program Synthesis

    Authors: Shreyas Kapur, Erik Jenner, Stuart Russell

    Abstract: Large language models generate code one token at a time. Their autoregressive generation process lacks the feedback of observing the program's output. Training LLMs to suggest edits directly can be challenging due to the scarcity of rich edit data. To address these problems, we propose neural diffusion models that operate on syntax trees of any context-free grammar. Similar to image diffusion mode… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: https://tree-diffusion.github.io

  4. arXiv:2405.17713  [pdf, other

    cs.AI cs.LG

    AI Alignment with Changing and Influenceable Reward Functions

    Authors: Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan

    Abstract: Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves. To clarify the consequences of incorrectly assuming static preferences, we introduce Dynamic Reward Markov Decision Processes (DR-MDPs), which explicitly model preference changes and the AI's influence on them.… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted to ICML 2024

  5. arXiv:2405.06624  [pdf, other

    cs.AI

    Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

    Authors: David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, Joshua Tenenbaum

    Abstract: Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these appro… ▽ More

    Submitted 17 May, 2024; v1 submitted 10 May, 2024; originally announced May 2024.

  6. arXiv:2405.04669  [pdf, other

    cs.LG cs.CL

    Towards a Theoretical Understanding of the 'Reversal Curse' via Training Dynamics

    Authors: Hanlin Zhu, Baihe Huang, Shaolun Zhang, Michael Jordan, Jiantao Jiao, Yuandong Tian, Stuart Russell

    Abstract: Auto-regressive large language models (LLMs) show impressive capacities to solve many complex reasoning tasks while struggling with some simple logical reasoning tasks such as inverse search: when trained on ''A is B'', LLM fails to directly conclude ''B is A'' during inference, which is known as the ''reversal curse'' (Berglund et al., 2023). In this paper, we theoretically analyze the reversal c… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: 40 pages, 15 figures

  7. arXiv:2404.10271  [pdf, other

    cs.LG cs.AI cs.CL cs.CY cs.GT

    Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback

    Authors: Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, William S. Zwicker

    Abstract: Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, such as hel** to commit crimes or producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level prin… ▽ More

    Submitted 4 June, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Comments: 15 pages, 4 figures

    MSC Class: 68T01; 68T50; 91B14; 91B12 ACM Class: I.2.0; I.2.7; K.4.2; I.2.m; J.4

  8. arXiv:2403.19107  [pdf

    cs.CV cs.LG

    Synthetic Medical Imaging Generation with Generative Adversarial Networks For Plain Radiographs

    Authors: John R. McNulty, Lee Kho, Alexandria L. Case, Charlie Fornaca, Drew Johnston, David Slater, Joshua M. Abzug, Sybil A. Russell

    Abstract: In medical imaging, access to data is commonly limited due to patient privacy restrictions and the issue that it can be difficult to acquire enough data in the case of rare diseases.[1] The purpose of this investigation was to develop a reusable open-source synthetic image generation pipeline, the GAN Image Synthesis Tool (GIST), that is easy to use as well as easy to deploy. The pipeline helps to… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Report number: Public Release Case Number 22-3965

  9. arXiv:2403.06003  [pdf, other

    cs.RO cs.AI cs.LG

    A Generalized Acquisition Function for Preference-based Reward Learning

    Authors: Evan Ellis, Gaurav R. Ghosal, Stuart J. Russell, Anca Dragan, Erdem Bıyık

    Abstract: Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. The information gain criterion focuses on precisely identifying all parameters of the rewa… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

  10. arXiv:2402.17747  [pdf, other

    cs.LG cs.AI stat.ML

    When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback

    Authors: Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

    Abstract: Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is… ▽ More

    Submitted 8 June, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

  11. arXiv:2402.08062  [pdf, ps, other

    cs.LG cs.AI

    Avoiding Catastrophe in Continuous Spaces by Asking for Help

    Authors: Benjamin Plaut, Hanlin Zhu, Stuart Russell

    Abstract: Most reinforcement learning algorithms with formal regret guarantees assume all mistakes are reversible and essentially rely on trying all possible behaviors. This approach leads to poor outcomes when some mistakes are irreparable or even catastrophic. We propose a variant of the contextual bandit problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the pay… ▽ More

    Submitted 26 May, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

  12. arXiv:2312.12747  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    ALMANACS: A Simulatability Benchmark for Language Model Explainability

    Authors: Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons

    Abstract: How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations impro… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Code is available at https://github.com/edmundmills/ALMANACS}{https://github.com/edmundmills/ALMANACS

  13. arXiv:2312.08369  [pdf, other

    stat.ML cs.AI cs.LG

    The Effective Horizon Explains Deep RL Performance in Stochastic Environments

    Authors: Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

    Abstract: Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neu… ▽ More

    Submitted 12 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

    Journal ref: ICLR 2024 (Spotlight)

  14. arXiv:2311.01011  [pdf, other

    cs.LG cs.CR

    Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

    Authors: Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell

    Abstract: While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by p… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  15. arXiv:2310.17688  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Managing extreme AI risks amid rapid progress

    Authors: Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

    Abstract: Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to develo** generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI's impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although rese… ▽ More

    Submitted 22 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: Published in Science: https://www.science.org/doi/10.1126/science.adn0117

  16. arXiv:2310.15288  [pdf, other

    cs.AI cs.LG

    Active teacher selection for reinforcement learning from human feedback

    Authors: Rachel Freedman, Justin Svegliato, Kyle Wray, Stuart Russell

    Abstract: Reinforcement learning from human feedback (RLHF) enables machine learning systems to learn objectives from human feedback. A core limitation of these systems is their assumption that all feedback comes from a single human teacher, despite querying a range of distinct teachers. We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher rationality, expertise, and costline… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  17. arXiv:2310.01706  [pdf, other

    cs.LG

    On Representation Complexity of Model-based and Model-free Reinforcement Learning

    Authors: Hanlin Zhu, Baihe Huang, Stuart Russell

    Abstract: We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in… ▽ More

    Submitted 10 March, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: 23 pages, 9 figures, to be published in ICLR 2024

  18. arXiv:2309.00236  [pdf, other

    cs.LG cs.CL cs.CR

    Image Hijacks: Adversarial Images can Control Generative Models at Runtime

    Authors: Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons

    Abstract: Are foundation models secure against malicious actors? In this work, we focus on the image input to a vision-language model (VLM). We discover image hijacks, adversarial images that control the behaviour of VLMs at inference time, and introduce the general Behaviour Matching algorithm for training image hijacks. From this, we derive the Prompt Matching method, allowing us to train hijacks matching… ▽ More

    Submitted 22 April, 2024; v1 submitted 31 August, 2023; originally announced September 2023.

    Comments: Project page at https://image-hijacks.github.io

  19. arXiv:2307.14745  [pdf, other

    cs.MA

    Using Multi-Agent MicroServices (MAMS) for Agent Based Modelling

    Authors: Martynas Jagutis, Sean Russell, Rem Collier

    Abstract: This paper demonstrates the use of the Multi-Agent MicroServices (MAMS) architectural style through a case study based around the development of a prototype traffic simulation in which agents model a population of individuals who travel from home to work and vice versa by car.

    Submitted 27 July, 2023; originally announced July 2023.

    Comments: 4 page demo paper accepted at EMAS. Paper has been extended from this version and submitted for publication in the formal proceedings

  20. arXiv:2306.09309  [pdf, other

    cs.AI cs.MA

    Who Needs to Know? Minimal Knowledge for Optimal Coordination

    Authors: Niklas Lauffer, Ameesh Shah, Micah Carroll, Michael Dennis, Stuart Russell

    Abstract: To optimally coordinate with others in cooperative games, it is often crucial to have information about one's collaborators: successful driving requires understanding which side of the road to drive on. However, not every feature of collaborators is strategically relevant: the fine-grained acceleration of drivers may be ignored while maintaining optimal coordination. We show that there is a well-d… ▽ More

    Submitted 13 July, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: To be published at ICML 2023

    ACM Class: I.2.6; I.2.11

  21. arXiv:2306.06924  [pdf, other

    cs.AI cs.CR cs.CY cs.LG

    TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI

    Authors: Andrew Critch, Stuart Russell

    Abstract: While several recent works have identified societal-scale and extinction-level risks to humanity arising from artificial intelligence, few have attempted an {\em exhaustive taxonomy} of such risks. Many exhaustive taxonomies are possible, and some are useful -- particularly if they reveal new risks or practical approaches to safety. This paper explores a taxonomy based on accountability: whose act… ▽ More

    Submitted 14 June, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

    MSC Class: 68T01 ACM Class: I.2.0

  22. arXiv:2304.09853  [pdf, other

    cs.LG stat.ML

    Bridging RL Theory and Practice with the Effective Horizon

    Authors: Cassidy Laidlaw, Stuart Russell, Anca Dragan

    Abstract: Deep reinforcement learning (RL) works impressively in some environments and fails catastrophically in others. Ideally, RL theory should be able to provide an understanding of why this is, i.e. bounds predictive of practical performance. Unfortunately, current theory does not quite have this ability. We compare standard deep RL algorithms to prior sample complexity bounds by introducing a new data… ▽ More

    Submitted 11 January, 2024; v1 submitted 19 April, 2023; originally announced April 2023.

    Journal ref: NeurIPS 2023 (Oral)

  23. arXiv:2303.00894  [pdf, other

    cs.LG cs.AI

    Active Reward Learning from Multiple Teachers

    Authors: Peter Barnett, Rachel Freedman, Justin Svegliato, Stuart Russell

    Abstract: Reward learning algorithms utilize human feedback to infer a reward function, which is then used to train an AI system. This human feedback is often a preference comparison, in which the human teacher compares several samples of AI behavior and chooses which they believe best accomplishes the objective. While reward learning typically assumes that all feedback comes from a single teacher, in pract… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

  24. arXiv:2211.11972  [pdf, other

    cs.LG cs.AI

    imitation: Clean Imitation Learning Implementations

    Authors: Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell

    Abstract: imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch. We include three inverse reinforcement learning (IRL) algorithms, three imitation learning algorithms and a preference comparison algorithm. The implementations have been benchmarked against previous results, and automated tests cover 98% of the code. Moreover, the algorithms are implemented in a… ▽ More

    Submitted 21 November, 2022; originally announced November 2022.

  25. arXiv:2211.00716  [pdf, ps, other

    cs.LG cs.AI math.OC math.ST stat.ML

    Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian

    Authors: Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart Russell, Jiantao Jiao

    Abstract: Offline reinforcement learning (RL), which refers to decision-making from a previously-collected dataset of interactions, has received significant attention over the past years. Much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have fi… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: 49 pages, 1 figure

  26. arXiv:2211.00241  [pdf, other

    cs.LG cs.AI cs.CR stat.ML

    Adversarial Policies Beat Superhuman Go AIs

    Authors: Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

    Abstract: We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human exper… ▽ More

    Submitted 13 July, 2023; v1 submitted 31 October, 2022; originally announced November 2022.

    Comments: Accepted to ICML 2023, see paper for changelog

    ACM Class: I.2.6

  27. arXiv:2208.07006  [pdf, ps, other

    cs.GT cs.LO cs.MA

    Cooperative and uncooperative institution designs: Surprises and problems in open-source game theory

    Authors: Andrew Critch, Michael Dennis, Stuart Russell

    Abstract: It is increasingly possible for real-world agents, such as software-based agents or human institutions, to view the internal programming of other such agents that they interact with. For instance, a company can read the bylaws of another company, or one software system can read the source code of another. Game-theoretic equilibria between the designers of such agents are called \emph{program equil… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

    Comments: 41 pages

    MSC Class: 93A14; 93A16; 91-08; 91A11; 91A35; 91A68; 91A44; 91B06; 91B41; 91B52 ACM Class: F.3.1; F.4.1; I.2.3; J.4

  28. arXiv:2207.03470  [pdf, other

    cs.GT cs.AI cs.LG cs.MA

    For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria

    Authors: Scott Emmons, Caspar Oesterheld, Andrew Critch, Vincent Conitzer, Stuart Russell

    Abstract: Although it has been known since the 1970s that a globally optimal strategy profile in a common-payoff game is a Nash equilibrium, global optimality is a strict requirement that limits the result's applicability. In this work, we show that any locally optimal symmetric strategy profile is also a (global) Nash equilibrium. Furthermore, we show that this result is robust to perturbations to the comm… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

  29. arXiv:2205.07886  [pdf, other

    cs.LG cs.AI

    An Empirical Investigation of Representation Learning for Imitation

    Authors: Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, ** Luo, Stuart Russell, Pieter Abbeel, Rohin Shah

    Abstract: Imitation learning often needs a large demonstration set in order to handle the full range of situations that an agent might find itself in during deployment. However, collecting expert demonstrations can be expensive. Recent work in vision, reinforcement learning, and NLP has shown that auxiliary representation learning objectives can reduce the need for large amounts of expensive, task-specific… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: Accepted to NeurIPS2021 Datasets and Benchmarks Track

  30. arXiv:2204.11966  [pdf, other

    cs.LG cs.IR

    Estimating and Penalizing Induced Preference Shifts in Recommender Systems

    Authors: Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell

    Abstract: The content that a recommender system (RS) shows to users influences them. Therefore, when choosing a recommender to deploy, one is implicitly also choosing to induce specific internal states in users. Even more, systems trained via long-horizon optimization will have direct incentives to manipulate users: in this work, we focus on the incentive to shift user preferences so they are easier to sati… ▽ More

    Submitted 14 July, 2022; v1 submitted 25 April, 2022; originally announced April 2022.

    Comments: Accepted to ICML 2022 (Spotlight)

    Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:2686-2708, 2022

  31. arXiv:2203.12053  [pdf, other

    eess.AS cs.SD

    Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content

    Authors: Haici Yang, Sanna Wager, Spencer Russell, Mike Luo, Minje Kim, Wontak Kim

    Abstract: In the stereo-to-multichannel upmixing problem for music, one of the main tasks is to set the directionality of the instrument sources in the multichannel rendering results. In this paper, we propose a modified variational autoencoder model that learns a latent space to describe the spatial images in multichannel music. We seek to disentangle the spatial images and music content, so the learned la… ▽ More

    Submitted 22 March, 2022; originally announced March 2022.

  32. arXiv:2203.07475  [pdf, other

    cs.LG cs.AI stat.ML

    Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

    Authors: Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave

    Abstract: It is often very challenging to manually design reward functions for complex, real-world tasks. To solve this, one can instead use reward learning to infer a reward function from data. However, there are often multiple reward functions that fit the data equally well, even in the infinite-data limit. This means that the reward function is only partially identifiable. In this work, we formally chara… ▽ More

    Submitted 7 June, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

    Comments: ICML 2023. 9 pages main paper, 26 pages total, 3 figures

    ACM Class: I.2.6

  33. arXiv:2110.08058  [pdf, other

    cs.LG cs.AI cs.NE

    Quantifying Local Specialization in Deep Neural Networks

    Authors: Shlomi Hod, Daniel Filan, Stephen Casper, Andrew Critch, Stuart Russell

    Abstract: A neural network is locally specialized to the extent that parts of its computational graph (i.e. structure) can be abstractly represented as performing some comprehensible sub-task relevant to the overall task (i.e. functionality). Are modern deep neural networks locally specialized? How can this be quantified? In this paper, we consider the problem of taking a neural network whose neurons are pa… ▽ More

    Submitted 7 February, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: 21 pages, 6 figures. Code is available at https://github.com/thestephencasper/detecting_nn_modularity

  34. arXiv:2110.03684  [pdf, other

    cs.LG cs.AI cs.RO stat.ML

    Cross-Domain Imitation Learning via Optimal Transport

    Authors: Arnaud Fickinger, Samuel Cohen, Stuart Russell, Brandon Amos

    Abstract: Cross-domain imitation learning studies how to leverage expert demonstrations of one agent to train an imitation agent with a different embodiment or morphology. Comparing trajectories and stationary distributions between the expert and imitation agents is challenging because they live on different systems that may not even have the same dimensionality. We propose Gromov-Wasserstein Imitation Lear… ▽ More

    Submitted 25 April, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: ICLR 2022

  35. arXiv:2109.15316  [pdf, other

    cs.AI

    Scalable Online Planning via Reinforcement Learning Fine-Tuning

    Authors: Arnaud Fickinger, Hengyuan Hu, Brandon Amos, Stuart Russell, Noam Brown

    Abstract: Lookahead search has been a critical component of recent AI successes, such as in the games of chess, go, and poker. However, the search methods used in these games, and in many other settings, are tabular. Tabular search methods do not scale well with the size of the search space, and this problem is exacerbated by stochasticity and partial observability. In this work we replace tabular search wi… ▽ More

    Submitted 30 September, 2021; originally announced September 2021.

  36. arXiv:2107.07394  [pdf, other

    cs.LG cs.AI

    Explore and Control with Adversarial Surprise

    Authors: Arnaud Fickinger, Natasha Jaques, Samyak Parajuli, Michael Chang, Nicholas Rhinehart, Glen Berseth, Stuart Russell, Sergey Levine

    Abstract: Unsupervised reinforcement learning (RL) studies how to leverage environment statistics to learn useful behaviors without the cost of reward engineering. However, a central challenge in unsupervised RL is to extract behaviors that meaningfully affect the world and cover the range of possible outcomes, without getting distracted by inherently unpredictable, uncontrollable, and stochastic elements i… ▽ More

    Submitted 28 December, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

  37. arXiv:2107.01969  [pdf, other

    cs.LG cs.AI

    The MineRL BASALT Competition on Learning from Human Feedback

    Authors: Rohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca Dragan

    Abstract: The last decade has seen a significant increase of interest in deep learning research, with many public successes that have demonstrated its potential. As such, these systems are now being incorporated into commercial products. With this comes an additional challenge: how can we build AI systems that solve tasks where there is not a crisp, well-defined specification? While multiple solutions have… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: NeurIPS 2021 Competition Track

  38. arXiv:2106.10394  [pdf, ps, other

    stat.ML cs.AI cs.LG

    Uncertain Decisions Facilitate Better Preference Learning

    Authors: Cassidy Laidlaw, Stuart Russell

    Abstract: Existing observational approaches for learning human preferences, such as inverse reinforcement learning, usually make strong assumptions about the observability of the human's environment. However, in reality, people make many important decisions under uncertainty. To better understand preference learning in these cases, we study the setting of inverse decision theory (IDT), a previously proposed… ▽ More

    Submitted 28 October, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

    Comments: Accepted at NeurIPS 2021 (Spotlight)

  39. arXiv:2106.10268  [pdf, other

    cs.LG cs.AI stat.ML

    MADE: Exploration via Maximizing Deviation from Explored Regions

    Authors: Tianjun Zhang, Paria Rashidinejad, Jiantao Jiao, Yuandong Tian, Joseph Gonzalez, Stuart Russell

    Abstract: In online reinforcement learning (RL), efficient exploration remains particularly challenging in high-dimensional environments with sparse rewards. In low-dimensional environments, where tabular parameterization is possible, count-based upper confidence bound (UCB) exploration methods achieve minimax near-optimal rates. However, it remains unclear how to efficiently implement UCB in realistic RL t… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: 28 pages, 10 figures

  40. arXiv:2103.12021  [pdf, other

    cs.LG cs.AI math.OC math.ST stat.ML

    Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

    Authors: Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, Stuart Russell

    Abstract: Offline (or batch) reinforcement learning (RL) algorithms seek to learn an optimal policy from a fixed dataset without active data collection. Based on the composition of the offline dataset, two main categories of methods are used: imitation learning which is suitable for expert datasets and vanilla offline RL which often requires uniform coverage datasets. From a practical standpoint, datasets o… ▽ More

    Submitted 3 July, 2023; v1 submitted 22 March, 2021; originally announced March 2021.

    Journal ref: Published at NeurIPS 2021 and IEEE Transactions on Information Theory

  41. arXiv:2103.03386  [pdf, other

    cs.NE

    Clusterability in Neural Networks

    Authors: Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell

    Abstract: The learned weights of a neural network have often been considered devoid of scrutable internal structure. In this paper, however, we look for structure in the form of clusterability: how well a network can be divided into groups of neurons with strong internal connectivity but weak external connectivity. We find that a trained neural network is typically more clusterable than randomly initialized… ▽ More

    Submitted 4 March, 2021; originally announced March 2021.

    Comments: 20 pages, 22 figures. arXiv admin note: text overlap with arXiv:2003.04881

  42. arXiv:2101.10305  [pdf, other

    cs.MA cs.AI

    Accumulating Risk Capital Through Investing in Cooperation

    Authors: Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell

    Abstract: Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully promote cooperation at the cost of becoming more vulnerable to exploitation by malicious actors. We show that this is an unavoidable trade-off and propose an objective which balances these concerns, promoting both safety and long-term cooperation. Moreover, the trade-off between safety and… ▽ More

    Submitted 20 April, 2021; v1 submitted 25 January, 2021; originally announced January 2021.

  43. arXiv:2012.14536  [pdf, other

    cs.GT cs.AI

    Multi-Principal Assistance Games: Definition and Collegial Mechanisms

    Authors: Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell

    Abstract: We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism. In an MPAG, a single agent assists N human principals who may have widely different preferences. MPAGs generalize assistance games, also known as cooperative inverse reinforcement learning game… ▽ More

    Submitted 28 December, 2020; originally announced December 2020.

    Comments: arXiv admin note: text overlap with arXiv:2007.09540

  44. arXiv:2012.05862  [pdf, other

    cs.LG

    Understanding Learned Reward Functions

    Authors: Eric J. Michaud, Adam Gleave, Stuart Russell

    Abstract: In many real-world tasks, it is not possible to procedurally specify an RL agent's reward function. In such cases, a reward function must instead be learned from interacting with and observing humans. However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences. Absent significant advances in reward learning, it is thus important to… ▽ More

    Submitted 10 December, 2020; originally announced December 2020.

    Comments: Presented at Deep RL Workshop, NeurIPS 2020

  45. arXiv:2012.02096  [pdf, other

    cs.LG cs.AI cs.MA

    Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

    Authors: Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, Sergey Levine

    Abstract: A wide range of reinforcement learning (RL) problems - including robustness, transfer learning, unsupervised RL, and emergent complexity - require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environmen… ▽ More

    Submitted 3 February, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

  46. arXiv:2012.01365  [pdf, other

    cs.LG cs.AI

    DERAIL: Diagnostic Environments for Reward And Imitation Learning

    Authors: Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell

    Abstract: The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or imitation learning algorithms to infer a reward or policy directly from human data. Existing benchmarks for these algorithms focus on realism, testing in complex environments. Unfortunately, these benchmarks are slow, unreliable and cannot isolate failures. As a complem… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

  47. arXiv:2011.00401  [pdf, other

    cs.LG cs.AI

    The MAGICAL Benchmark for Robust Imitation

    Authors: Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell

    Abstract: Imitation Learning (IL) algorithms are typically evaluated in the same environment that was used to create demonstrations. This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings. This paper presents the MAGICAL benchmark… ▽ More

    Submitted 31 October, 2020; originally announced November 2020.

    Comments: NeurIPS 2020 conference paper (poster)

  48. arXiv:2010.05899  [pdf, other

    cs.LG eess.SY math.OC math.ST stat.ML

    SLIP: Learning to Predict in Unknown Dynamical Systems with Long-Term Memory

    Authors: Paria Rashidinejad, Jiantao Jiao, Stuart Russell

    Abstract: We present an efficient and practical (polynomial time) algorithm for online prediction in unknown and partially observed linear dynamical systems (LDS) under stochastic noise. When the system parameters are known, the optimal linear predictor is the Kalman filter. However, the performance of existing predictive models is poor in important classes of LDS that are only marginally stable and exhibit… ▽ More

    Submitted 12 October, 2020; originally announced October 2020.

    Comments: 47 pages, 3 figures

  49. arXiv:2007.09540  [pdf, other

    cs.AI

    Multi-Principal Assistance Games

    Authors: Arnaud Fickinger, Simon Zhuang, Dylan Hadfield-Menell, Stuart Russell

    Abstract: Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function. This paper studies multi-principal assistance games, which cover the more general case in which the robot acts on behalf of N humans who may hav… ▽ More

    Submitted 18 July, 2020; originally announced July 2020.

  50. arXiv:2006.13900  [pdf, other

    cs.LG cs.AI stat.ML

    Quantifying Differences in Reward Functions

    Authors: Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, Jan Leike

    Abstract: For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimiza… ▽ More

    Submitted 17 March, 2021; v1 submitted 24 June, 2020; originally announced June 2020.

    Comments: Published at ICLR 2021. 9 pages main paper, 42 pages total

    ACM Class: I.2.6