-
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Authors:
Edmund Mills,
Shiye Su,
Stuart Russell,
Scott Emmons
Abstract:
How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations impro…
▽ More
How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. The ALMANACS scenarios span twelve safety-relevant topics such as ethical reasoning and advanced AI behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark. We use ALMANACS to evaluate counterfactuals, rationalizations, attention, and Integrated Gradients explanations. Our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. We conclude that despite modest successes in prior work, develo** an explanation method that aids simulatability in ALMANACS remains an open challenge.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Stochastic Scaling in Loss Functions for Physics-Informed Neural Networks
Authors:
Ethan Mills,
Alexey Pozdnyakov
Abstract:
Differential equations are used in a wide variety of disciplines, describing the complex behavior of the physical world. Analytic solutions to these equations are often difficult to solve for, limiting our current ability to solve complex differential equations and necessitating sophisticated numerical methods to approximate solutions. Trained neural networks act as universal function approximator…
▽ More
Differential equations are used in a wide variety of disciplines, describing the complex behavior of the physical world. Analytic solutions to these equations are often difficult to solve for, limiting our current ability to solve complex differential equations and necessitating sophisticated numerical methods to approximate solutions. Trained neural networks act as universal function approximators, able to numerically solve differential equations in a novel way. In this work, methods and applications of neural network algorithms for numerically solving differential equations are explored, with an emphasis on varying loss functions and biological applications. Variations on traditional loss function and training parameters show promise in making neural network-aided solutions more efficient, allowing for the investigation of more complex equations governing biological principles.
△ Less
Submitted 7 August, 2022;
originally announced August 2022.
-
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback
Authors:
Rohin Shah,
Steven H. Wang,
Cody Wild,
Stephanie Milani,
Anssi Kanervisto,
Vinicius G. Goecks,
Nicholas Waytowich,
David Watkins-Valls,
Bharat Prakash,
Edmund Mills,
Divyansh Garg,
Alexander Fries,
Alexandra Souly,
Chan Jun Shern,
Daniel del Castillo,
Tom Lieberum
Abstract:
We held the first-ever MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT) Competition at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021). The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks. Rather than mandating the use of LfHF techniques,…
▽ More
We held the first-ever MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT) Competition at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021). The goal of the competition was to promote research towards agents that use learning from human feedback (LfHF) techniques to solve open-world tasks. Rather than mandating the use of LfHF techniques, we described four tasks in natural language to be accomplished in the video game Minecraft, and allowed participants to use any approach they wanted to build agents that could accomplish the tasks. Teams developed a diverse range of LfHF algorithms across a variety of possible human feedback types. The three winning teams implemented significantly different approaches while achieving similar performance. Interestingly, their approaches performed well on different tasks, validating our choice of tasks to include in the competition. While the outcomes validated the design of our competition, we did not get as many participants and submissions as our sister competition, MineRL Diamond. We speculate about the causes of this problem and suggest improvements for future iterations of the competition.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.