Skip to main content

Showing 1–7 of 7 results for author: Nadeau, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2309.05973  [pdf, other

    cs.CL cs.LG

    Circuit Breaking: Removing Model Behaviors with Targeted Ablation

    Authors: Maximilian Li, Xander Davies, Max Nadeau

    Abstract: Language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where t… ▽ More

    Submitted 29 January, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

    Journal ref: Workshop on Challenges in Deployable Generative AI at International Conference on Machine Learning (ICML), Honolulu, Hawaii, USA. 2023

  2. arXiv:2308.15605  [pdf, other

    cs.LG

    Benchmarks for Detecting Measurement Tampering

    Authors: Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

    Abstract: When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measuremen… ▽ More

    Submitted 29 September, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

    Comments: Edits: extended and improved appendices, fixed references, figures, and typos

  3. arXiv:2307.15217  [pdf, other

    cs.AI cs.CL cs.LG

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Authors: Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen , et al. (7 additional authors not shown)

    Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and rel… ▽ More

    Submitted 11 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

  4. arXiv:2307.03637  [pdf, other

    cs.AI

    Discovering Variable Binding Circuitry with Desiderata

    Authors: Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

    Abstract: Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{deside… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  5. arXiv:2205.15201  [pdf, ps, other

    cs.RO

    Controller design and experimental evaluation of a motorised assistance for a patient transfer floor lift

    Authors: Donatien Callon, Ian Lalonde, Mathieu Nadeau, Alexandre Girard

    Abstract: Patient transfer is a challenging, critical task because it exposes caregivers to injury risks. Available transfer devices, like floor lifts, lead to improvements but are far from perfect. They do not eliminate the caregivers risk of musculoskeletal disorders, and they can be burdensome to use due to their poor maneuverability. This paper presents a new motorized floor lift with a single central m… ▽ More

    Submitted 23 May, 2024; v1 submitted 30 May, 2022; originally announced May 2022.

  6. arXiv:2110.03605  [pdf, other

    cs.LG cs.AI cs.CV

    Robust Feature-Level Adversaries are Interpretability Tools

    Authors: Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

    Abstract: The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we… ▽ More

    Submitted 11 September, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2022, code available at https://github.com/thestephencasper/feature_level_adv

  7. arXiv:2004.13709  [pdf, other

    cs.CR eess.SP eess.SY

    A Low-Power Dual-Factor Authentication Unit for Secure Implantable Devices

    Authors: Saurav Maji, Utsav Banerjee, Samuel H Fuller, Mohamed R Abdelhamid, Phillip M Nadeau, Rabia Tugce Yazicigil, Anantha P Chandrakasan

    Abstract: This paper presents a dual-factor authentication protocol and its low-power implementation for security of implantable medical devices (IMDs). The protocol incorporates traditional cryptographic first-factor authentication using Datagram Transport Layer Security - Pre-Shared Key (DTLS-PSK) followed by the user's touch-based voluntary second-factor authentication for enhanced security. With a low-p… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: Published in 2020 IEEE Custom Integrated Circuits Conference (CICC)