Skip to main content

Showing 1–19 of 19 results for author: Krakovna, V

.
  1. arXiv:2404.16244  [pdf, other

    cs.CY

    The Ethics of Advanced AI Assistants

    Authors: Iason Gabriel, Arianna Manzini, Geoff Keeling, Lisa Anne Hendricks, Verena Rieser, Hasan Iqbal, Nenad TomaĊĦev, Ira Ktena, Zachary Kenton, Mikel Rodriguez, Seliem El-Sayed, Sasha Brown, Canfer Akbulut, Andrew Trask, Edward Hughes, A. Stevie Bergman, Renee Shelby, Nahema Marchal, Conor Griffin, Juan Mateos-Garcia, Laura Weidinger, Winnie Street, Benjamin Lange, Alex Ingerman, Alison Lentz , et al. (32 additional authors not shown)

    Abstract: This paper focuses on the opportunities and the ethical and societal risks posed by advanced AI assistants. We define advanced AI assistants as artificial agents with natural language interfaces, whose function is to plan and execute sequences of actions on behalf of a user, across one or more domains, in line with the user's expectations. The paper starts by considering the technology itself, pro… ▽ More

    Submitted 28 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

  2. arXiv:2403.13793  [pdf, other

    cs.LG

    Evaluating Frontier Models for Dangerous Capabilities

    Authors: Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah , et al. (2 additional authors not shown)

    Abstract: To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous… ▽ More

    Submitted 5 April, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

  3. arXiv:2402.05829  [pdf, other

    cs.AI

    Limitations of Agents Simulated by Predictive Models

    Authors: Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna

    Abstract: There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidde… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  4. arXiv:2401.03529  [pdf, other

    cs.AI

    Quantifying stability of non-power-seeking in artificial agents

    Authors: Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna

    Abstract: We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment--we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking--an agent… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: 37 pages, 5 figures

  5. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  6. arXiv:2304.06528  [pdf, other

    cs.AI

    Power-seeking can be probable and predictive for trained agents

    Authors: Victoria Krakovna, Janos Kramar

    Abstract: Power-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited. Building on existing theoretical results demonstrating power-seeking incentives for most reward functions, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some simpl… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

  7. arXiv:2210.01790  [pdf, other

    cs.LG

    Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

    Authors: Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton

    Abstract: The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that the designers did not foresee. However, an AI system may pursue an undesired goal even when the specification is correct, in the case of goal misgener… ▽ More

    Submitted 2 November, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

  8. arXiv:2011.08827  [pdf, other

    cs.LG cs.AI

    Avoiding Tampering Incentives in Deep RL via Decoupled Approval

    Authors: Jonathan Uesato, Ramana Kumar, Victoria Krakovna, Tom Everitt, Richard Ngo, Shane Legg

    Abstract: How can we design agents that pursue a given objective when all feedback mechanisms are influenceable by the agent? Standard RL algorithms assume a secure reward function, and can thus perform poorly in settings where agents can tamper with the reward-generating mechanism. We present a principled solution to the problem of learning from influenceable feedback, which combines approval with a decoup… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

  9. arXiv:2011.08820  [pdf, other

    cs.LG cs.AI

    REALab: An Embedded Perspective on Tampering

    Authors: Ramana Kumar, Jonathan Uesato, Richard Ngo, Tom Everitt, Victoria Krakovna, Shane Legg

    Abstract: This paper describes REALab, a platform for embedded agency research in reinforcement learning (RL). REALab is designed to model the structure of tampering problems that may arise in real-world deployments of RL. Standard Markov Decision Process (MDP) formulations of RL and simulated environments mirroring the MDP structure assume secure access to feedback (e.g., rewards). This may be unrealistic… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

  10. arXiv:2010.07877  [pdf, other

    cs.LG cs.AI

    Avoiding Side Effects By Considering Future Tasks

    Authors: Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, Shane Legg

    Abstract: Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while completing the task). To alleviate the burden on the reward designer, we propose an algorithm to automatically generate an auxiliary reward function that penalizes side effects. This auxiliary objective rewards the… ▽ More

    Submitted 15 October, 2020; originally announced October 2020.

    Comments: Published in NeurIPS 2020

  11. arXiv:1908.04734  [pdf, ps, other

    cs.AI cs.LG

    Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective

    Authors: Tom Everitt, Marcus Hutter, Ramana Kumar, Victoria Krakovna

    Abstract: Can humans get arbitrarily capable reinforcement learning (RL) agents to do their bidding? Or will sufficiently capable RL agents always find ways to bypass their intended objectives by shortcutting their reward signal? This question impacts how far RL can be scaled, and whether alternative paradigms must be developed in order to build safe artificial general intelligence. In this paper, we study… ▽ More

    Submitted 26 March, 2021; v1 submitted 13 August, 2019; originally announced August 2019.

    Comments: Accepted to Synthese, March 2021

  12. arXiv:1906.08663  [pdf, other

    cs.AI

    Modeling AGI Safety Frameworks with Causal Influence Diagrams

    Authors: Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg

    Abstract: Proposals for safe AGI systems are typically made at the level of frameworks, specifying how the components of the proposed system should be trained and interact with each other. In this paper, we model and compare the most promising AGI safety frameworks using causal influence diagrams. The diagrams show the optimization objective and causal assumptions of the framework. The unified representatio… ▽ More

    Submitted 20 June, 2019; originally announced June 2019.

    Comments: IJCAI 2019 AI Safety Workshop

  13. arXiv:1806.01186  [pdf, other

    cs.LG cs.AI stat.ML

    Penalizing side effects using stepwise relative reachability

    Authors: Victoria Krakovna, Laurent Orseau, Ramana Kumar, Miljan Martic, Shane Legg

    Abstract: How can we design safe reinforcement learning agents that avoid unnecessary disruptions to their environment? We show that current approaches to penalizing side effects can introduce bad incentives, e.g. to prevent any irreversible changes in the environment, including the actions of other agents. To isolate the source of such undesirable incentives, we break down side effects penalties into two c… ▽ More

    Submitted 8 March, 2019; v1 submitted 4 June, 2018; originally announced June 2018.

  14. arXiv:1711.09883  [pdf, other

    cs.LG cs.AI

    AI Safety Gridworlds

    Authors: Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg

    Abstract: We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environ… ▽ More

    Submitted 28 November, 2017; v1 submitted 27 November, 2017; originally announced November 2017.

  15. arXiv:1705.08417  [pdf, other

    cs.AI cs.LG stat.ML

    Reinforcement Learning with a Corrupted Reward Channel

    Authors: Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, Shane Legg

    Abstract: No real-world reward function is perfect. Sensory errors and software bugs may result in RL agents observing higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward… ▽ More

    Submitted 19 August, 2017; v1 submitted 23 May, 2017; originally announced May 2017.

    Comments: A shorter version of this report was accepted to IJCAI 2017 AI and Autonomy track

    ACM Class: I.2.6; I.2.8

  16. arXiv:1611.05934  [pdf, other

    stat.ML cs.LG

    Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models

    Authors: Viktoriya Krakovna, Finale Doshi-Velez

    Abstract: As deep neural networks continue to revolutionize various application domains, there is increasing interest in making these powerful models more understandable and interpretable, and narrowing down the causes of good and bad predictions. We focus on recurrent neural networks, state of the art models in speech recognition and translation. Our approach to increasing interpretability is by combining… ▽ More

    Submitted 17 November, 2016; originally announced November 2016.

    Comments: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems. arXiv admin note: substantial text overlap with arXiv:1606.05320

  17. arXiv:1606.05320  [pdf, other

    stat.ML cs.CL cs.LG

    Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models

    Authors: Viktoriya Krakovna, Finale Doshi-Velez

    Abstract: As deep neural networks continue to revolutionize various application domains, there is increasing interest in making these powerful models more understandable and interpretable, and narrowing down the causes of good and bad predictions. We focus on recurrent neural networks (RNNs), state of the art models in speech recognition and translation. Our approach to increasing interpretability is by com… ▽ More

    Submitted 30 September, 2016; v1 submitted 16 June, 2016; originally announced June 2016.

    Comments: presented at 2016 ICML Workshop on Human Interpretability in Machine Learning (WHI 2016), New York, NY

  18. arXiv:1602.04259  [pdf, other

    cs.AI cs.LG stat.ML

    A Minimalistic Approach to Sum-Product Network Learning for Real Applications

    Authors: Viktoriya Krakovna, Moshe Looks

    Abstract: Sum-Product Networks (SPNs) are a class of expressive yet tractable hierarchical graphical models. LearnSPN is a structure learning algorithm for SPNs that uses hierarchical co-clustering to simultaneously identifying similar entities and similar features. The original LearnSPN algorithm assumes that all the variables are discrete and there is no missing data. We introduce a practical, simplified… ▽ More

    Submitted 24 April, 2016; v1 submitted 12 February, 2016; originally announced February 2016.

    Comments: Accepted to ICLR 2016 workshop track

  19. arXiv:1506.02371  [pdf, other

    stat.ML

    Interpretable Selection and Visualization of Features and Interactions Using Bayesian Forests

    Authors: Viktoriya Krakovna, Jiong Du, Jun S. Liu

    Abstract: It is becoming increasingly important for machine learning methods to make predictions that are interpretable as well as accurate. In many practical applications, it is of interest which features and feature interactions are relevant to the prediction task. We present a novel method, Selective Bayesian Forest Classifier, that strikes a balance between predictive power and interpretability by simul… ▽ More

    Submitted 7 February, 2016; v1 submitted 8 June, 2015; originally announced June 2015.

    Comments: R package: github.com/vkrakovna/sbfc