Search | arXiv e-print repository

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety. △ Less

Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: updated to add missing acknowledgements

arXiv:2308.00862 [pdf, ps, other]

Confidence-Building Measures for Artificial Intelligence: Workshop Proceedings

Authors: Sarah Shoker, Andrew Reddie, Sarah Barrington, Ruby Booth, Miles Brundage, Husanjot Chahal, Michael Depp, Bill Drexel, Ritwik Gupta, Marina Favaro, Jake Hecla, Alan Hickey, Margarita Konaev, Kirthi Kumar, Nathan Lambert, Andrew Lohn, Cullen O'Keefe, Nazneen Rajani, Michael Sellitto, Robert Trager, Leah Walker, Alexa Wehsener, Jessica Young

Abstract: Foundation models could eventually introduce several pathways for undermining state security: accidents, inadvertent escalation, unintentional conflict, the proliferation of weapons, and the interference with human diplomacy are just a few on a long list. The Confidence-Building Measures for Artificial Intelligence workshop hosted by the Geopolitics Team at OpenAI and the Berkeley Risk and Securit… ▽ More Foundation models could eventually introduce several pathways for undermining state security: accidents, inadvertent escalation, unintentional conflict, the proliferation of weapons, and the interference with human diplomacy are just a few on a long list. The Confidence-Building Measures for Artificial Intelligence workshop hosted by the Geopolitics Team at OpenAI and the Berkeley Risk and Security Lab at the University of California brought together a multistakeholder group to think through the tools and strategies to mitigate the potential risks introduced by foundation models to international security. Originating in the Cold War, confidence-building measures (CBMs) are actions that reduce hostility, prevent conflict escalation, and improve trust between parties. The flexibility of CBMs make them a key instrument for navigating the rapid changes in the foundation model landscape. Participants identified the following CBMs that directly apply to foundation models and which are further explained in this conference proceedings: 1. crisis hotlines 2. incident sharing 3. model, transparency, and system cards 4. content provenance and watermarks 5. collaborative red teaming and table-top exercises and 6. dataset and evaluation sharing. Because most foundation model developers are non-government entities, many CBMs will need to involve a wider stakeholder community. These measures can be implemented either by AI labs or by relevant government actors. △ Less

Submitted 3 August, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

arXiv:2306.17682 [pdf]

ADS Standardization Landscape: Making Sense of its Status and of the Associated Research Questions

Authors: Scott Schnelle, Francesca M. Favaro

Abstract: Automated Driving Systems (ADS) hold great potential to increase safety, mobility, and equity. However, without public acceptance, none of these promises can be fulfilled. To engender public trust, many entities in the ADS community participate in standards development organizations (SDOs) with the goal of enhancing safety for the entire industry through a collaborative approach. The breadth and d… ▽ More Automated Driving Systems (ADS) hold great potential to increase safety, mobility, and equity. However, without public acceptance, none of these promises can be fulfilled. To engender public trust, many entities in the ADS community participate in standards development organizations (SDOs) with the goal of enhancing safety for the entire industry through a collaborative approach. The breadth and depth of the ADS safety standardization landscape is vast and constantly changing, as often is the case for novel technologies in rapid evolution. The pace of development of the ADS industry makes it hard for the public and interested parties to keep track of ongoing SDO efforts, including the topics touched by each standard and the committees addressing each topic, as well as make sense of the wealth of documentation produced. Therefore, the authors present here a simplified framework for abstracting and organizing the current landscape of ADS safety standards into high-level, long term themes. This framework is then utilized to develop and organize associated research questions that have not yet reached widely adopted industry positions, along with identifying potential gaps where further research and standardization is needed. △ Less

Submitted 30 June, 2023; originally announced June 2023.

Comments: 13 pages, 1 figure

arXiv:2306.14923 [pdf]

Interpreting Safety Outcomes: Waymo's Performance Evaluation in the Context of a Broader Determination of Safety Readiness

Authors: Francesca M. Favaro, Trent Victor, Henning Hohnhold, Scott Schnelle

Abstract: This paper frames recent publications from Waymo within the broader context of the safety readiness determination for an Automated Driving System (ADS). Starting from a brief overview of safety performance outcomes reported by Waymo (i.e., contact events experienced during fully autonomous operations), this paper highlights the need for a diversified approach to safety determination that complemen… ▽ More This paper frames recent publications from Waymo within the broader context of the safety readiness determination for an Automated Driving System (ADS). Starting from a brief overview of safety performance outcomes reported by Waymo (i.e., contact events experienced during fully autonomous operations), this paper highlights the need for a diversified approach to safety determination that complements the analysis of observed safety outcomes with other estimation techniques. Our discussion highlights: the presentation of a "credibility paradox" within the comparison between ADS crash data and human-derived baselines; the recognition of continuous confidence growth through in-use monitoring; and the need to supplement any aggregate statistical analysis with appropriate event-level reasoning. △ Less

Submitted 23 June, 2023; originally announced June 2023.

arXiv:2104.08090 [pdf]

doi 10.1088/1361-6463/abe5e2

Near ambient pressure photoelectron spectro-microscopy: from gas-solid interface to operando devices

Authors: Matteo Amati, Luca Gregoratti, Patrick Zeller, Mark Greiner, Mattia Scardamaglia, Benjamin Junker, Tamara Ruß, Udo Weimar, Nicolae Barsan, Marco Favaro, Abdulaziz Alharbi, Ingvild J. T. Jensen, Ayaz Ali, Branson D. Belle

Abstract: Near Ambient Pressure Scanning Photoelectron Microscopy adds to the widely used photoemission spectroscopy and its chemically selective capability two key features: (i) the possibility to chemically analyse samples in a more realistic environmental, gas pressure condition, and (ii) the capability to investigate a system at the relevant spatial scale. To achieve these goals the approach developed a… ▽ More Near Ambient Pressure Scanning Photoelectron Microscopy adds to the widely used photoemission spectroscopy and its chemically selective capability two key features: (i) the possibility to chemically analyse samples in a more realistic environmental, gas pressure condition, and (ii) the capability to investigate a system at the relevant spatial scale. To achieve these goals the approach developed at the ESCA Microscopy beamline at the Elettra Synchrotron facility combines the submicron lateral resolution of a Scanning Photoelectron Microscope with a custom designed Near Ambient Pressure Cell where a gas pressure up to 0.1 mbar is confined inside it around the sample. In this manuscript a review of experiments performed with this unique setup will be presented to illustrate its potentiality in both fundamental and applicative research such as the oxidation reactivity and gas sensitivity of metal oxides and semiconductors. In particular the capability to do operando experiment with this setup opens the possibility to perform investigations with active devices to properly address the real nature of the studied systems, because it can yield to more conclusive results when microscopy and spectroscopy are simultaneously combined in a single technique. △ Less

Submitted 16 April, 2021; originally announced April 2021.

Journal ref: Journal of Physics D: Applied Physics (2021)

arXiv:2012.10779 [pdf]

doi 10.1063/5.0025326

Soft X-ray spectroscopies in liquids and at solid-liquid interface at BACH beamline at Elettra

Authors: Silvia Nappini, Luca D'Amario, Marco Favaro, Simone Dal Zilio, Federico Salvador, Erik Betz-Guttner, Andrea Fondacaro, Igor Pis, Luca Romanzin, Alessandro Gambitta, Federica Bondino, Marco Lazzarino, Elena Magnano

Abstract: The Beamline for Advanced diCHroism (BACH) of the Istituto Officina dei Materiali-Consiglio Nazionale delle Ricerche (IOM-CNR), operating at Elettra synchrotron in Trieste (Italy), works in the extreme ultra violet (EUV)-soft X-ray photon energy range with selectable light polarization, high energy resolution, brilliance and time resolution. The beamline offers a multi-technique approach for the i… ▽ More The Beamline for Advanced diCHroism (BACH) of the Istituto Officina dei Materiali-Consiglio Nazionale delle Ricerche (IOM-CNR), operating at Elettra synchrotron in Trieste (Italy), works in the extreme ultra violet (EUV)-soft X-ray photon energy range with selectable light polarization, high energy resolution, brilliance and time resolution. The beamline offers a multi-technique approach for the investigation of the electronic, chemical, structural, magnetic, and dynamical properties of materials. Recently one of the three end-stations has been dedicated to experiments based on electron transfer processes at the solid/liquid interfaces and during photocatalytic or electrochemical reactions. Suitable cells to perform soft X-ray spectroscopy in the presence of liquids and reagent gases at ambient pressure were developed. Here we present two types of static cells working in transmission or in fluorescence yield, and an electrochemical flow cell which allows to carry out cyclic voltammetry in situ, electrodeposition on a working electrode (WE) and to study chemical reactions in-operando conditions. Examples of X-ray absorption spectroscopy (XAS) measurements performed in ambient conditions and during electrochemical experiments in liquid are presented. △ Less

Submitted 19 December, 2020; originally announced December 2020.

Comments: 30 pages, 14 figures, accepted by Review of Scientific Instruments

arXiv:1909.01752 [pdf, other]

SATURN -- Software Deobfuscation Framework Based on LLVM

Authors: Peter Garba, Matteo Favaro

Abstract: The strength of obfuscated software has increased over the recent years. Compiler based obfuscation has become the de facto standard in the industry and recent papers also show that injection of obfuscation techniques is done at the compiler level. In this paper we discuss a generic approach for deobfuscation and recompilation of obfuscated code based on the compiler framework LLVM. We show how bi… ▽ More The strength of obfuscated software has increased over the recent years. Compiler based obfuscation has become the de facto standard in the industry and recent papers also show that injection of obfuscation techniques is done at the compiler level. In this paper we discuss a generic approach for deobfuscation and recompilation of obfuscated code based on the compiler framework LLVM. We show how binary code can be lifted back into the compiler intermediate language LLVM-IR and explain how we recover the control flow graph of an obfuscated binary function with an iterative control flow graph construction algorithm based on compiler optimizations and SMT solving. Our approach does not make any assumptions about the obfuscated code, but instead uses strong compiler optimizations available in LLVM and Souper Optimizer to simplify away the obfuscation. Our experimental results show that this approach can be effective to weaken or even remove the applied obfuscation techniques like constant unfolding, certain arithmetic-based opaque expressions, dead code insertions, bogus control flow or integer encoding found in public and commercial obfuscators. The recovered LLVM-IR can be further processed by custom deobfuscation passes that are now applied at the same level as the injected obfuscation techniques or recompiled with one of the available LLVM backends. The presented work is implemented in a deobfuscation tool called SATURN. △ Less

Submitted 5 September, 2019; v1 submitted 4 September, 2019; originally announced September 2019.

Comments: reverse engineering, llvm, code lifting, obfuscation, deobfuscation, static software analysis, binary recompilation, binary rewriting

Journal ref: 3rd International Workshop on Software PROtection, Nov 2019, London, United Kingdom

arXiv:1612.05621 [pdf, other]

doi 10.1007/JHEP05(2017)008

Intrinsic limits on resolutions in muon- and electron-neutrino charged-current events in the KM3NeT/ORCA detector

Authors: S. Adrián-Martínez, M. Ageron, S. Aiello, A. Albert, F. Ameli, E. G. Anassontzis, M. Andre, G. Androulakis, M. Anghinolfi, G. Anton, M. Ardid, T. Avgitas, G. Barbarino, E. Barbarito, B. Baret, J. Barrios-Martí, A. Belias, E. Berbee, A. van den Berg, V. Bertin, S. Beurthey, V. van Beveren, N. Beverini, S. Biagi, A. Biagioni , et al. (228 additional authors not shown)

Abstract: Studying atmospheric neutrino oscillations in the few-GeV range with a multimegaton detector promises to determine the neutrino mass hierarchy. This is the main science goal pursued by the future KM3NeT/ORCA water Cherenkov detector in the Mediterranean Sea. In this paper, the processes that limit the obtainable resolution in both energy and direction in charged-current neutrino events in the ORCA… ▽ More Studying atmospheric neutrino oscillations in the few-GeV range with a multimegaton detector promises to determine the neutrino mass hierarchy. This is the main science goal pursued by the future KM3NeT/ORCA water Cherenkov detector in the Mediterranean Sea. In this paper, the processes that limit the obtainable resolution in both energy and direction in charged-current neutrino events in the ORCA detector are investigated. These processes include the composition of the hadronic fragmentation products, the subsequent particle propagation and the photon-sampling fraction of the detector. GEANT simulations of neutrino interactions in seawater produced by GENIE are used to study the effects in the 1 - 20 GeV range. It is found that fluctuations in the hadronic cascade in conjunction with the variation of the inelasticity y are most detrimental to the resolutions. The effect of limited photon sampling in the detector is of significantly less importance. These results will therefore also be applicable to similar detectors/media, such as those in ice. △ Less

Submitted 19 May, 2017; v1 submitted 29 November, 2016; originally announced December 2016.

Comments: 37 pages, 28 figures, JHEP published version

Journal ref: The KM3NeT collaboration, Adri{á}n-Mart{\'ı}nez, S., Ageron, M. et al. J. High Energ. Phys. (2017) 2017: 8

Showing 1–8 of 8 results for author: Favaro, M