-
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Authors:
Erik Jenner,
Shreyas Kapur,
Vasil Georgiev,
Cameron Allen,
Scott Emmons,
Stuart Russell
Abstract:
Do neural networks learn to implement algorithms such as look-ahead or search "in the wild"? Or do they rely purely on collections of simple heuristics? We present evidence of learned look-ahead in the policy network of Leela Chess Zero, the currently strongest neural chess engine. We find that Leela internally represents future optimal moves and that these representations are crucial for its fina…
▽ More
Do neural networks learn to implement algorithms such as look-ahead or search "in the wild"? Or do they rely purely on collections of simple heuristics? We present evidence of learned look-ahead in the policy network of Leela Chess Zero, the currently strongest neural chess engine. We find that Leela internally represents future optimal moves and that these representations are crucial for its final output in certain board states. Concretely, we exploit the fact that Leela is a transformer that treats every chessboard square like a token in language models, and give three lines of evidence (1) activations on certain squares of future moves are unusually important causally; (2) we find attention heads that move important information "forward and backward in time," e.g., from squares of future moves to squares of earlier ones; and (3) we train a simple probe that can predict the optimal move 2 turns ahead with 92% accuracy (in board states where Leela finds a single best line). These findings are an existence proof of learned look-ahead in neural networks and might be a step towards a better understanding of their capabilities.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Diffusion On Syntax Trees For Program Synthesis
Authors:
Shreyas Kapur,
Erik Jenner,
Stuart Russell
Abstract:
Large language models generate code one token at a time. Their autoregressive generation process lacks the feedback of observing the program's output. Training LLMs to suggest edits directly can be challenging due to the scarcity of rich edit data. To address these problems, we propose neural diffusion models that operate on syntax trees of any context-free grammar. Similar to image diffusion mode…
▽ More
Large language models generate code one token at a time. Their autoregressive generation process lacks the feedback of observing the program's output. Training LLMs to suggest edits directly can be challenging due to the scarcity of rich edit data. To address these problems, we propose neural diffusion models that operate on syntax trees of any context-free grammar. Similar to image diffusion models, our method also inverts ``noise'' applied to syntax trees. Rather than generating code sequentially, we iteratively edit it while preserving syntactic validity, which makes it easy to combine this neural model with search. We apply our approach to inverse graphics tasks, where our model learns to convert images into programs that produce those images. Combined with search, our model is able to write graphics programs, see the execution result, and debug them to meet the required specifications. We additionally show how our system can write graphics programs for hand-drawn sketches.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Authors:
Usman Anwar,
Abulhair Saparov,
Javier Rando,
Daniel Paleka,
Miles Turpin,
Peter Hase,
Ekdeep Singh Lubana,
Erik Jenner,
Stephen Casper,
Oliver Sourbut,
Benjamin L. Edelman,
Zhaowei Zhang,
Mario Günther,
Anton Korinek,
Jose Hernandez-Orallo,
Lewis Hammond,
Eric Bigelow,
Alexander Pan,
Lauro Langosco,
Tomasz Korbak,
Heidi Zhang,
Ruiqi Zhong,
Seán Ó hÉigeartaigh,
Gabriel Recchia,
Giulio Corsi
, et al. (13 additional authors not shown)
Abstract:
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedback
Authors:
Leon Lang,
Davis Foote,
Stuart Russell,
Anca Dragan,
Erik Jenner,
Scott Emmons
Abstract:
Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is…
▽ More
Past analyses of reinforcement learning from human feedback (RLHF) assume that the human evaluators fully observe the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deceptive inflation and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. Under the new assumption that the human's partial observability is known and accounted for, we then analyze how much information the feedback process provides about the return function. We show that sometimes, the human's feedback determines the return function uniquely up to an additive constant, but in other realistic cases, there is irreducible ambiguity. We propose exploratory research directions to help tackle these challenges and caution against blindly applying RLHF in partially observable settings.
△ Less
Submitted 8 June, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
STARC: A General Framework For Quantifying Differences Between Reward Functions
Authors:
Joar Skalse,
Lucy Farnik,
Sumeet Ramesh Motwani,
Erik Jenner,
Adam Gleave,
Alessandro Abate
Abstract:
In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use \emph{reward learning algorithms}, which attempt to \emph{learn} a reward fun…
▽ More
In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use \emph{reward learning algorithms}, which attempt to \emph{learn} a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.
△ Less
Submitted 11 March, 2024; v1 submitted 26 September, 2023;
originally announced September 2023.
-
imitation: Clean Imitation Learning Implementations
Authors:
Adam Gleave,
Mohammad Taufeeque,
Juan Rocamonde,
Erik Jenner,
Steven H. Wang,
Sam Toyer,
Maximilian Ernestus,
Nora Belrose,
Scott Emmons,
Stuart Russell
Abstract:
imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch. We include three inverse reinforcement learning (IRL) algorithms, three imitation learning algorithms and a preference comparison algorithm. The implementations have been benchmarked against previous results, and automated tests cover 98% of the code. Moreover, the algorithms are implemented in a…
▽ More
imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch. We include three inverse reinforcement learning (IRL) algorithms, three imitation learning algorithms and a preference comparison algorithm. The implementations have been benchmarked against previous results, and automated tests cover 98% of the code. Moreover, the algorithms are implemented in a modular fashion, making it simple to develop novel algorithms in the framework. Our source code, including documentation and examples, is available at https://github.com/HumanCompatibleAI/imitation
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
Calculus on MDPs: Potential Sha** as a Gradient
Authors:
Erik Jenner,
Herke van Hoof,
Adam Gleave
Abstract:
In reinforcement learning, different reward functions can be equivalent in terms of the optimal policies they induce. A particularly well-known and important example is potential sha**, a class of functions that can be added to any reward function without changing the optimal policy set under arbitrary transition dynamics. Potential sha** is conceptually similar to potentials, conservative vec…
▽ More
In reinforcement learning, different reward functions can be equivalent in terms of the optimal policies they induce. A particularly well-known and important example is potential sha**, a class of functions that can be added to any reward function without changing the optimal policy set under arbitrary transition dynamics. Potential sha** is conceptually similar to potentials, conservative vector fields and gauge transformations in math and physics, but this connection has not previously been formally explored. We develop a formalism for discrete calculus on graphs that abstract a Markov Decision Process, and show how potential sha** can be formally interpreted as a gradient within this framework. This allows us to strengthen results from Ng et al. (1999) describing conditions under which potential sha** is the only additive reward transformation to always preserve optimal policies. As an additional application of our formalism, we define a rule for picking a single unique reward function from each potential sha** equivalence class.
△ Less
Submitted 2 December, 2022; v1 submitted 19 August, 2022;
originally announced August 2022.
-
Preprocessing Reward Functions for Interpretability
Authors:
Erik Jenner,
Adam Gleave
Abstract:
In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward func…
▽ More
In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward function to spot potential deviations from the user's intention. Existing work has applied general-purpose interpretability tools to understand learned reward functions. We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions, which are then visualized. We introduce a general framework for such reward preprocessing and propose concrete preprocessing algorithms. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.
△ Less
Submitted 25 March, 2022;
originally announced March 2022.
-
Extensions of Karger's Algorithm: Why They Fail in Theory and How They Are Useful in Practice
Authors:
Erik Jenner,
Enrique Fita Sanmartín,
Fred A. Hamprecht
Abstract:
The minimum graph cut and minimum $s$-$t$-cut problems are important primitives in the modeling of combinatorial problems in computer science, including in computer vision and machine learning. Some of the most efficient algorithms for finding global minimum cuts are randomized algorithms based on Karger's groundbreaking contraction algorithm. Here, we study whether Karger's algorithm can be succe…
▽ More
The minimum graph cut and minimum $s$-$t$-cut problems are important primitives in the modeling of combinatorial problems in computer science, including in computer vision and machine learning. Some of the most efficient algorithms for finding global minimum cuts are randomized algorithms based on Karger's groundbreaking contraction algorithm. Here, we study whether Karger's algorithm can be successfully generalized to other cut problems. We first prove that a wide class of natural generalizations of Karger's algorithm cannot efficiently solve the $s$-$t$-mincut or the normalized cut problem to optimality. However, we then present a simple new algorithm for seeded segmentation / graph-based semi-supervised learning that is closely based on Karger's original algorithm, showing that for these problems, extensions of Karger's algorithm can be useful. The new algorithm has linear asymptotic runtime and yields a potential that can be interpreted as the posterior probability of a sample belonging to a given seed / class. We clarify its relation to the random walker algorithm / harmonic energy minimization in terms of distributions over spanning forests. On classical problems from seeded image segmentation and graph-based semi-supervised learning on image data, the method performs at least as well as the random walker / harmonic energy minimization / Gaussian processes.
△ Less
Submitted 16 December, 2021; v1 submitted 5 October, 2021;
originally announced October 2021.
-
Steerable Partial Differential Operators for Equivariant Neural Networks
Authors:
Erik Jenner,
Maurice Weiler
Abstract:
Recent work in equivariant deep learning bears strong similarities to physics. Fields over a base space are fundamental entities in both subjects, as are equivariant maps between these fields. In deep learning, however, these maps are usually defined by convolutions with a kernel, whereas they are partial differential operators (PDOs) in physics. Develo** the theory of equivariant PDOs in the co…
▽ More
Recent work in equivariant deep learning bears strong similarities to physics. Fields over a base space are fundamental entities in both subjects, as are equivariant maps between these fields. In deep learning, however, these maps are usually defined by convolutions with a kernel, whereas they are partial differential operators (PDOs) in physics. Develo** the theory of equivariant PDOs in the context of deep learning could bring these subjects even closer together and lead to a stronger flow of ideas. In this work, we derive a $G$-steerability constraint that completely characterizes when a PDO between feature vector fields is equivariant, for arbitrary symmetry groups $G$. We then fully solve this constraint for several important groups. We use our solutions as equivariant drop-in replacements for convolutional layers and benchmark them in that role. Finally, we develop a framework for equivariant maps based on Schwartz distributions that unifies classical convolutions and differential operators and gives insight about the relation between the two.
△ Less
Submitted 23 April, 2022; v1 submitted 18 June, 2021;
originally announced June 2021.
-
Stability of Superhydrophobic Ring & Axle Liquid Bearings
Authors:
Elliot Jenner,
Brian D'Urso
Abstract:
Friction between contacting solid surfaces is a dominant force on the micro-scale and a major consideration in the design of MEMS. Non-contact fluid bearings have been investigated as a way to mitigate this issue. Here we discuss a new design for surface tension-supported thrust bearings utilizing patterned superhydrophobic surfaces to achieve improved drag reduction. We examine sources of instabi…
▽ More
Friction between contacting solid surfaces is a dominant force on the micro-scale and a major consideration in the design of MEMS. Non-contact fluid bearings have been investigated as a way to mitigate this issue. Here we discuss a new design for surface tension-supported thrust bearings utilizing patterned superhydrophobic surfaces to achieve improved drag reduction. We examine sources of instability in the design, and demonstrate that it can be simply modeled and has superior stiffness as compared to other designs.
△ Less
Submitted 24 June, 2015;
originally announced June 2015.
-
Absolute Measurement Of Laminar Shear Rate Using Photon Correlation Spectroscopy
Authors:
Elliot Jenner,
Brian D'Urso
Abstract:
An absolute measurement of the components of the shear rate tensor $\mathcal{S}$ in a fluid can be found by measuring the photon correlation function of light scattered from particles in the fluid. Previous methods of measuring $\mathcal{S}$ involve reading the velocity at various points and extrapolating the shear, which can be time consuming and is limited in its ability to examine small spatial…
▽ More
An absolute measurement of the components of the shear rate tensor $\mathcal{S}$ in a fluid can be found by measuring the photon correlation function of light scattered from particles in the fluid. Previous methods of measuring $\mathcal{S}$ involve reading the velocity at various points and extrapolating the shear, which can be time consuming and is limited in its ability to examine small spatial scale or short time events. Previous work in Photon Correlation Spectroscopy has involved only approximate solutions, requiring free parameters to be scaled by a known case, or different cases, such as 2-D flows, but here we present a treatment that provides quantitative results directly and without calibration for full 3-D flow. We demonstrate this treatment experimentally with a cone and plate rheometer.
△ Less
Submitted 11 May, 2015; v1 submitted 8 May, 2015;
originally announced May 2015.
-
Large Drag Reduction over Superhydrophobic Riblets
Authors:
Charlotte Barbier,
Elliot Jenner,
Brian D'Urso
Abstract:
Riblets and superhydrophobic surfaces are two demonstrated passive drag reduction techniques. We describe a method to fabricate surfaces that combine both of these techniques in order to increase drag reduction properties. Samples have been tested with a cone-and-plate rheometer system, and have demonstrated significant drag reduction even in the transitional-turbulent regime. Direct Numerical Sim…
▽ More
Riblets and superhydrophobic surfaces are two demonstrated passive drag reduction techniques. We describe a method to fabricate surfaces that combine both of these techniques in order to increase drag reduction properties. Samples have been tested with a cone-and-plate rheometer system, and have demonstrated significant drag reduction even in the transitional-turbulent regime. Direct Numerical Simulations have been performed in order to estimate the equivalent slip length at higher rotational speed. The sample with 100~$μ$m deep grooves has been performing very well, showing drag reduction varying from 15 to 20 $\%$ over the whole range of flow conditions tested, and its slip length was estimated to be over 100 $μ$m.
△ Less
Submitted 2 June, 2014;
originally announced June 2014.