Skip to main content

Showing 1–1 of 1 results for author: Gunter, E R

.
  1. arXiv:2401.03529  [pdf, other

    cs.AI

    Quantifying stability of non-power-seeking in artificial agents

    Authors: Evan Ryan Gunter, Yevgeny Liokumovich, Victoria Krakovna

    Abstract: We investigate the question: if an AI agent is known to be safe in one setting, is it also safe in a new setting similar to the first? This is a core question of AI alignment--we train and test models in a certain environment, but deploy them in another, and we need to guarantee that models that seem safe in testing remain so in deployment. Our notion of safety is based on power-seeking--an agent… ▽ More

    Submitted 7 January, 2024; originally announced January 2024.

    Comments: 37 pages, 5 figures