Search | arXiv e-print repository

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Authors: Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

Abstract: In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be to… ▽ More In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove. △ Less

Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: Make it easier to find samples from the model, and highlight that our operational definition of reward tampering has false positives where the model attempts to complete the task honestly but edits the reward. Add paragraph to conclusion to this effect, and add sentence to figure 1 to this effect

arXiv:2202.12412 [pdf, other]

Fourier-Based Augmentations for Improved Robustness and Uncertainty Calibration

Authors: Ryan Soklaski, Michael Yee, Theodoros Tsiligkaridis

Abstract: Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks -- without incurring substantial losses in robustness against other classes of corruptions -- remains elusive. In this work… ▽ More Diverse data augmentation strategies are a natural approach to improving robustness in computer vision models against unforeseen shifts in data distribution. However, the ability to tailor such strategies to inoculate a model against specific classes of corruptions or attacks -- without incurring substantial losses in robustness against other classes of corruptions -- remains elusive. In this work, we successfully harden a model against Fourier-based attacks, while producing superior-to-AugMix accuracy and calibration results on both the CIFAR-10-C and CIFAR-100-C datasets; classification error is reduced by over ten percentage points for some high-severity noise and digital-type corruptions. We achieve this by incorporating Fourier-basis perturbations in the AugMix image-augmentation framework. Thus we demonstrate that the AugMix framework can be tailored to effectively target particular distribution shifts, while boosting overall model robustness. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia

arXiv:2202.03188 [pdf]

Knowledge-Integrated Informed AI for National Security

Authors: Anu K. Myne, Kevin J. Leahy, Ryan J. Soklaski

Abstract: The state of artificial intelligence technology has a rich history that dates back decades and includes two fall-outs before the explosive resurgence of today, which is credited largely to data-driven techniques. While AI technology has and continues to become increasingly mainstream with impact across domains and industries, it's not without several drawbacks, weaknesses, and potential to cause u… ▽ More The state of artificial intelligence technology has a rich history that dates back decades and includes two fall-outs before the explosive resurgence of today, which is credited largely to data-driven techniques. While AI technology has and continues to become increasingly mainstream with impact across domains and industries, it's not without several drawbacks, weaknesses, and potential to cause undesired effects. AI techniques are numerous with many approaches and variants, but they can be classified simply based on the degree of knowledge they capture and how much data they require; two broad categories emerge as prominent across AI to date: (1) techniques that are primarily, and often solely, data-driven while leveraging little to no knowledge and (2) techniques that primarily leverage knowledge and depend less on data. Now, a third category is starting to emerge that leverages both data and knowledge, that some refer to as "informed AI." This third category can be a game changer within the national security domain where there is ample scientific and domain-specific knowledge that stands ready to be leveraged, and where purely data-driven AI can lead to serious unwanted consequences. This report shares findings from a thorough exploration of AI approaches that exploit data as well as principled and/or practical knowledge, which we refer to as "knowledge-integrated informed AI." Specifically, we review illuminating examples of knowledge integrated in deep learning and reinforcement learning pipelines, taking note of the performance gains they provide. We also discuss an apparent trade space across variants of knowledge-integrated informed AI, along with observed and prominent issues that suggest worthwhile future research directions. Most importantly, this report suggests how the advantages of knowledge-integrated informed AI stand to benefit the national security domain. △ Less

Submitted 4 February, 2022; originally announced February 2022.

Report number: Technical Report TR-1272

arXiv:2201.05647 [pdf, other]

Tools and Practices for Responsible AI Engineering

Authors: Ryan Soklaski, Justin Goodwin, Olivia Brown, Michael Yee, Jason Matterer

Abstract: Responsible Artificial Intelligence (AI) - the practice of develo**, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability - represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries - hy… ▽ More Responsible Artificial Intelligence (AI) - the practice of develo**, evaluating, and maintaining accurate AI systems that also exhibit essential properties such as robustness and explainability - represents a multifaceted challenge that often stretches standard machine learning tooling, frameworks, and testing methods beyond their limits. In this paper, we present two new software libraries - hydra-zen and the rAI-toolbox - that address critical needs for responsible AI engineering. hydra-zen dramatically simplifies the process of making complex AI applications configurable, and their behaviors reproducible. The rAI-toolbox is designed to enable methods for evaluating and enhancing the robustness of AI-models in a way that is scalable and that composes naturally with other popular ML frameworks. We describe the design principles and methodologies that make these tools effective, including the use of property-based testing to bolster the reliability of the tools themselves. Finally, we demonstrate the composability and flexibility of the tools by showing how various use cases from adversarial robustness and explainable AI can be concisely implemented with familiar APIs. △ Less

Submitted 14 January, 2022; originally announced January 2022.

arXiv:1803.05268 [pdf, other]

doi 10.1109/CVPR.2018.00519

Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

Authors: David Mascharka, Philip Tran, Ryan Soklaski, Arjun Majumdar

Abstract: Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoni… ▽ More Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art. △ Less

Submitted 2 July, 2018; v1 submitted 14 March, 2018; originally announced March 2018.

Comments: CVPR 2018 pre-print

arXiv:1508.02736 [pdf, other]

doi 10.1103/PhysRevB.93.214201

A Dramatically Growing Shear Rigidity Length Scale in a Supercooled Glass Former ($NiZr_2$)

Authors: Nicholas B. Weingartner, Ryan Soklaski, K. F. Kelton, Zohar Nussinov

Abstract: Finding a suitably growing length scale that increases in tandem with the immense viscous slowdown of supercooled liquids is an open problem associated with the glass transition. Here, we define and demonstrate the existence of one such length scale which may be experimentally verifiable. This is the length scale over which external shear perturbations appreciably penetrate into a liquid as the gl… ▽ More Finding a suitably growing length scale that increases in tandem with the immense viscous slowdown of supercooled liquids is an open problem associated with the glass transition. Here, we define and demonstrate the existence of one such length scale which may be experimentally verifiable. This is the length scale over which external shear perturbations appreciably penetrate into a liquid as the glass transition is approached. We provide simulation based evidence of its existence, and its growth by at least an order of magnitude, by using molecular dynamics simulations of NiZr2, a good fragile glass former. On the probed timescale, upon approaching the glass transition temperature from above, this length scale, ξ is also shown to be consistent with Ising-like scaling. Furthermore, we demonstrate the possible scaling of ξ about the temperature at which super-Arrhenius growth of viscosity, and a marked growth of the penetration depth sets in. Our simulation results suggest that upon supercooling, marked initial increase of the shear penetration depth in fluids may occur in tandem with the breakdown of the Stokes-Einstein relation. △ Less

Submitted 6 April, 2016; v1 submitted 11 August, 2015; originally announced August 2015.

Comments: 10 pages, 12 figures;Renamed, Massive Revisions, PRB Accepted

Journal ref: Phys. Rev. B 93, 214201 (2016)

arXiv:1502.01739 [pdf, ps, other]

doi 10.1080/14786435.2016.1158427

A locally preferred structure characterises all dynamical regimes of a supercooled liquid

Authors: Ryan Soklaski, Vy Tran, Zohar Nussinov, K. F. Kelton, Li Yang

Abstract: Recent experimental results suggest that metallic liquids universally exhibit a high-temperature dynamical crossover, which is correlated with the glass transition temperature ($T_{g}$). We demonstrate, using molecular dynamics results for Cu64Zr36, that this temperature, $T_{A} \approx 2 \times T_{g}$, is linked with cooperative atomic rearrangements that produce domains of connected icosahedra.… ▽ More Recent experimental results suggest that metallic liquids universally exhibit a high-temperature dynamical crossover, which is correlated with the glass transition temperature ($T_{g}$). We demonstrate, using molecular dynamics results for Cu64Zr36, that this temperature, $T_{A} \approx 2 \times T_{g}$, is linked with cooperative atomic rearrangements that produce domains of connected icosahedra. Supercooling to a new characteristic temperature, $T_{D}$, is shown to produce higher order cooperative rearrangements amongst connected icosahedra, leading to large-scale domain fluctuations and the onset of glassy dynamics. These extensive domains then abruptly stabilize above $T_{g}$ and eventually percolate before the glass is formed. All characteristic temperatures ($T_{A}$, $T_{D}$ and $T_{g}$) are thus connected by successive manifestations of the structural cooperativity that begins at $T_{A}$. △ Less

Submitted 23 March, 2016; v1 submitted 5 February, 2015; originally announced February 2015.

Comments: 21 pages with 9 figures, Philosophical Magazine, 2016

arXiv:1405.2836 [pdf]

doi 10.1021/nl502865s

Enhanced Thermoelectric Efficiency via Orthogonal Electrical and Thermal Conductances in Phosphorene

Authors: Ruixiang Fei, Alireza Faghaninia, Ryan Soklaski, Jia-An Yan, Cynthia Lo, Li Yang

Abstract: Thermoelectric devices that utilize the Seebeck effect convert heat flow into electrical energy and are highly desirable for the development of portable, solid state, passively-powered electronic systems. The conversion efficiencies of such devices are quantified by the dimensionless thermoelectric figure of merit (ZT), which is proportional to the ratio of a device's electrical conductance to its… ▽ More Thermoelectric devices that utilize the Seebeck effect convert heat flow into electrical energy and are highly desirable for the development of portable, solid state, passively-powered electronic systems. The conversion efficiencies of such devices are quantified by the dimensionless thermoelectric figure of merit (ZT), which is proportional to the ratio of a device's electrical conductance to its thermal conductance. High ZT (>2) has been achieved in materials via all-scale hierarchical architecturing. This efficiency holds at high temperatures (700K~900K) but quickly diminishes at lower temperatures. In this paper, a recently-fabricated two-dimensional (2D) semiconductor called phosphorene (monolayer black phosphorus) is assessed for its thermoelectric capabilities. First-principles and model calculations reveal that phosphorene possesses spatially-anisotropic electrical and thermal conductances. The prominent electrical and thermal conducting directions are orthogonal to one another, enhancing the ratio of these conductances. As a result, ZT can reach 2.5 (the criterion for commercial deployment) along the armchair direction of phosphorene at T=500K and is greater than 1 even at room temperature given moderate do** (~2 x 10^16 m-2). Ultimately, phosphorene stands out as an environmentally sound thermoelectric material with unprecedented qualities: intrinsically, it is a mechanically flexible material that converts heat energy with high efficiency at low temperatures (~ 300K) - one whose performance does not require any sophisticated engineering techniques. △ Less

Submitted 12 May, 2014; originally announced May 2014.

Comments: 22 pages with 6 figures

Journal ref: Nano Lett, 14, 6393 (2014)

arXiv:1402.4192 [pdf, ps, other]

doi 10.1103/PhysRevB.89.235319

Tunable Band Gap and Anisotropic Optical Response in Few-layer Black Phosphorus

Authors: Vy Tran, Ryan Soklaski, Yufeng Liang, Li Yang

Abstract: We report the quasiparticle band gap, excitons, and highly anisotropic optical responses of few-layer black phosphorous (phosphorene). It is shown that these new materials exhibit unique many-electron effects; the electronic structures are dispersive essentially along one dimension, leading to particularly enhanced self-energy corrections and excitonic effects. Additionally, within a wide energy r… ▽ More We report the quasiparticle band gap, excitons, and highly anisotropic optical responses of few-layer black phosphorous (phosphorene). It is shown that these new materials exhibit unique many-electron effects; the electronic structures are dispersive essentially along one dimension, leading to particularly enhanced self-energy corrections and excitonic effects. Additionally, within a wide energy range, including infrared light and part of visible light, few-layer black phosphorous absorbs light polarized along the structure's armchair direction and is transparent to light polarized along the zigzag direction, making them viable linear polarizers for applications. Finally, the number of phosphorene layers included in the stack controls the material's band gap, optical absorption spectrum, and anisotropic polarization energy-window across a wide range. △ Less

Submitted 15 April, 2014; v1 submitted 17 February, 2014; originally announced February 2014.

Comments: 12 pages with 5 figures and 1 table

Journal ref: Phys. Rev. B 89, 235319 (2014)

arXiv:1401.6663 [pdf]

doi 10.1103/PhysRevB.90.115418

New Mechanism for Strongly Bound Excitons in Gapless Two-Dimensional Structures

Authors: Yufeng Liang, Ryan Soklaski, Shouting Huang, Matthew W. Graham, Robin Havener, Jiwoong Park, Li Yang

Abstract: Common wisdom asserts that bound excitons cannot form in high-dimensional (d>1) metallic structures because of their overwhelming screening and unavoidable resonance with nearby continuous bands. Strikingly, here we illustrate that this prevalent assumption is not quite true. A key ingredient that has been overlooked is that of viable decoherence that thwarts the formation of resonances. As an exa… ▽ More Common wisdom asserts that bound excitons cannot form in high-dimensional (d>1) metallic structures because of their overwhelming screening and unavoidable resonance with nearby continuous bands. Strikingly, here we illustrate that this prevalent assumption is not quite true. A key ingredient that has been overlooked is that of viable decoherence that thwarts the formation of resonances. As an example of this general mechanism, we focus on an experimentally relevant material and predict bound excitons in twisted bilayer graphene, which is a two-dimensional gapless structure exhibiting metallic screening. The binding energies calculated by first-principles simulations are surprisingly large. The low-energy effective model reveals that these bound states are produced by a unique destructive coherence between two alike subband resonant excitons. In particular, this destructive coherent effect is not sensitive to the screening and dimensionality, and hence may persist as a general mechanism for creating bound excitons in various metallic structures, opening the door for excitonic applications based on metallic structures. △ Less

Submitted 26 January, 2014; originally announced January 2014.

Comments: 12 pages and 5 figures

Journal ref: Phys. Rev. B 90, 115418 (2014)

arXiv:1401.5732 [pdf, ps, other]

doi 10.1063/1.4878098

Temperature Renormalization of Optical Spectra of Monolayer MoS2

Authors: Ryan Soklaski, Yufeng Liang, Changjian Zhang, Haining Wang, Farhan Rana, Li Yang

Abstract: Newly measured optical absorption and photoluminescence spectra reveal substantial frequency shifts of both exciton and trion peaks as monolayer MoS2 is cooled from 363 K to 4 K. First-principles simulations using the GW-Bethe-Salpeter Equation approach satisfactorily reproduce these frequency shifts by incorporating many-electron interactions and the thermal expansion of the in-plane lattice cons… ▽ More Newly measured optical absorption and photoluminescence spectra reveal substantial frequency shifts of both exciton and trion peaks as monolayer MoS2 is cooled from 363 K to 4 K. First-principles simulations using the GW-Bethe-Salpeter Equation approach satisfactorily reproduce these frequency shifts by incorporating many-electron interactions and the thermal expansion of the in-plane lattice constant. Studying these temperature effects in monolayer MoS2 is crucial for rectifying the results of room-temperature experiments with the previous predictions of zero-temperature-limit simulations. Moreover, we estimate that the thermal expansion coefficient of monolayer MoS2 is around 25% less than that of bulk counterpart by tracking the frequency shifts of the exciton or trion peaks in optical spectra. This may serve as a convenient way to estimate thermal expansion coefficients of general two-dimensional chalcogenides. △ Less

Submitted 22 January, 2014; originally announced January 2014.

Comments: 12 pages and 4 figures

Journal ref: Appl. Phys. Lett. 104, 193110 (2014)

arXiv:1306.0620 [pdf, ps, other]

doi 10.1063/1.4816517

Quasiparticle band-edge energy and band offsets of monolayer of molybdenum and tungsten chalcogenide

Authors: Yufeng Liang, Shouting Huang, Ryan Soklaski, Li Yang

Abstract: We report the quasiparticle energy of monolayer of molybdenum and tungsten dichalcogenides, MX2 (M=Mo, W; X=S, Se, Te). Beyond calculating bandgaps, we have achieved converged absolute band energies relative to the vacuum level. Compared with the results from other approaches, the GW calculation reveals substantially larger bandgaps and different absolute band energies because of enhanced many-ele… ▽ More We report the quasiparticle energy of monolayer of molybdenum and tungsten dichalcogenides, MX2 (M=Mo, W; X=S, Se, Te). Beyond calculating bandgaps, we have achieved converged absolute band energies relative to the vacuum level. Compared with the results from other approaches, the GW calculation reveals substantially larger bandgaps and different absolute band energies because of enhanced many-electron effects. Interestingly, our fully-converged quasiparticle energies ratify the band-gap-center approximation, making it a convenient way to estimate quasiparticle energy. The absolute quasiparticle energies and band offsets obtained in this work are important for designing heterojunction devices and chemical catalysts based on monolayer dichalcogenides. △ Less

Submitted 26 July, 2013; v1 submitted 3 June, 2013; originally announced June 2013.

Journal ref: Appl. Phys. Lett. 103, 042106 (2013)

arXiv:1302.1895 [pdf, ps, other]

doi 10.1103/PhysRevB.87.184203

Connectivity of the Icosahedral Network and a Dramatically Growing Static Length Scale in Cu-Zr Binary Metallic Glasses

Authors: Ryan Soklaski, Zohar Nussinov, Zachary Markow, K. F. Kelton, Li Yang

Abstract: We report on and characterize, via molecular dynamics (MD) studies, the evolution of the structure of Cu50Zr50 and Cu64Zr36 metallic glasses (MGs) as temperature is varied. Interestingly, a percolating icosahedral network appears in the Cu64Zr36 system as it is supercooled. This leads us to introduce a static length scale, which grows dramatically as this three dimensional system approaches the gl… ▽ More We report on and characterize, via molecular dynamics (MD) studies, the evolution of the structure of Cu50Zr50 and Cu64Zr36 metallic glasses (MGs) as temperature is varied. Interestingly, a percolating icosahedral network appears in the Cu64Zr36 system as it is supercooled. This leads us to introduce a static length scale, which grows dramatically as this three dimensional system approaches the glass transition. Amidst interpenetrating connections, non-interpenetrating connections between icosahedra are shown to become prevalent upon supercooling and to greatly enhance the connectivity of the MG's icosahedral network. Additionally, we characterize the chemical compositions of the icosahedral networks and their components. These findings demonstrate the importance of non-interpenetrating connections for facilitating extensive structural networks in Cu-Zr MGs, which in turn drive dynamical slowing in these materials. △ Less

Submitted 7 February, 2013; originally announced February 2013.

Comments: 9 pages and 8 figures

Journal ref: Phys. Rev. B 87, 184203 (2013)

Showing 1–13 of 13 results for author: Soklaski, R