Skip to main content

Showing 1–12 of 12 results for author: Mindermann, S

.
  1. arXiv:2401.05566  [pdf, other

    cs.CR cs.AI cs.CL cs.LG cs.SE

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

    Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More

    Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: updated to add missing acknowledgements

  2. arXiv:2310.17688  [pdf, other

    cs.CY cs.AI cs.CL cs.LG

    Managing extreme AI risks amid rapid progress

    Authors: Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

    Abstract: Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to develo** generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI's impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although rese… ▽ More

    Submitted 22 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: Published in Science: https://www.science.org/doi/10.1126/science.adn0117

  3. arXiv:2310.13798  [pdf, other

    cs.CL cs.AI

    Specific versus General Principles for Constitutional AI

    Authors: Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova DasSarma, Oliver Rausch, Robin Larson , et al. (11 additional authors not shown)

    Abstract: Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expressi… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  4. arXiv:2309.15840  [pdf, other

    cs.CL cs.AI cs.LG

    How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

    Authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

    Abstract: Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  5. arXiv:2209.00626  [pdf, ps, other

    cs.AI cs.LG

    The Alignment Problem from a Deep Learning Perspective

    Authors: Richard Ngo, Lawrence Chan, Sören Mindermann

    Abstract: In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned inte… ▽ More

    Submitted 19 March, 2024; v1 submitted 29 August, 2022; originally announced September 2022.

    Comments: Published in ICLR 2024

  6. arXiv:2206.07137  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

    Authors: Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal

    Abstract: Training on web-scale data can take months. But most computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mi… ▽ More

    Submitted 26 September, 2022; v1 submitted 14 June, 2022; originally announced June 2022.

    Comments: ICML 2022

  7. arXiv:2107.02565  [pdf, other

    cs.LG cs.IT

    Prioritized training on points that are learnable, worth learning, and not yet learned (workshop version)

    Authors: Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma, Adrien Morisot, Aidan N. Gomez, Sebastian Farquhar, Jan Brauner, Yarin Gal

    Abstract: We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are "just right". We propose an information-theoretic acquisition function -- the reducible validation loss -- and compute it with a small proxy model -- GoldiProx -- to efficiently choose training points that maximize information about a validation set. We show that the "hard"… ▽ More

    Submitted 17 October, 2023; v1 submitted 6 July, 2021; originally announced July 2021.

    Journal ref: ICML 2021 Workshop on Subset Selection in Machine Learning

  8. arXiv:2103.04850  [pdf, other

    cs.LG stat.ML

    Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

    Authors: Andrew Jesson, Sören Mindermann, Yarin Gal, Uri Shalit

    Abstract: We study the problem of learning conditional average treatment effects (CATE) from high-dimensional, observational data with unobserved confounders. Unobserved confounders introduce ignorance -- a level of unidentifiability -- about an individual's response to treatment by inducing bias in CATE estimates. We present a new parametric interval estimator suited for high-dimensional data, that estimat… ▽ More

    Submitted 1 February, 2022; v1 submitted 8 March, 2021; originally announced March 2021.

    Comments: 19 pages, 5 figures, ICML 2021

    Journal ref: PMLR 139 (2021) 4829-4838

  9. arXiv:2007.13454  [pdf, other

    stat.AP cs.LG q-bio.PE q-bio.QM stat.ML

    How Robust are the Estimated Effects of Nonpharmaceutical Interventions against COVID-19?

    Authors: Mrinank Sharma, Sören Mindermann, Jan Markus Brauner, Gavin Leech, Anna B. Stephenson, Tomáš Gavenčiak, Jan Kulveit, Yee Whye Teh, Leonid Chindelevitch, Yarin Gal

    Abstract: To what extent are effectiveness estimates of nonpharmaceutical interventions (NPIs) against COVID-19 influenced by the assumptions our models make? To answer this question, we investigate 2 state-of-the-art NPI effectiveness models and propose 6 variants that make different structural assumptions. In particular, we investigate how well NPI effectiveness estimates generalise to unseen countries, a… ▽ More

    Submitted 20 December, 2020; v1 submitted 27 July, 2020; originally announced July 2020.

    Journal ref: NeurIPS 2020, Advances in Neural Information Processing Systems 33

  10. arXiv:2007.00163  [pdf, other

    cs.LG stat.ML

    Identifying Causal-Effect Inference Failure with Uncertainty-Aware Models

    Authors: Andrew Jesson, Sören Mindermann, Uri Shalit, Yarin Gal

    Abstract: Recommending the best course of action for an individual is a major application of individual-level causal effect estimation. This application is often needed in safety-critical domains such as healthcare, where estimating and communicating uncertainty to decision-makers is crucial. We introduce a practical approach for integrating uncertainty estimation into a class of state-of-the-art neural net… ▽ More

    Submitted 22 October, 2020; v1 submitted 30 June, 2020; originally announced July 2020.

  11. arXiv:1809.03060  [pdf, other

    cs.LG cs.AI stat.ML

    Active Inverse Reward Design

    Authors: Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

    Abstract: Designers of AI agents often iterate on the reward function in a trial-and-error process until they get the desired behavior, but this only guarantees good behavior in the training environment. We propose structuring this process as a series of queries asking the user to compare between different reward functions. Thus we can actively select queries for maximum informativeness about the true rewar… ▽ More

    Submitted 6 November, 2019; v1 submitted 9 September, 2018; originally announced September 2018.

  12. arXiv:1712.05812  [pdf, ps, other

    cs.AI

    Occam's razor is insufficient to infer the preferences of irrational agents

    Authors: Stuart Armstrong, Sören Mindermann

    Abstract: Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known… ▽ More

    Submitted 11 January, 2019; v1 submitted 15 December, 2017; originally announced December 2017.