Skip to main content

Showing 1–5 of 5 results for author: Halawi, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20053  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

    Authors: Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt

    Abstract: Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious d… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 22 pages

  2. arXiv:2405.06846  [pdf, other

    cs.AI

    Dominion: A New Frontier for AI Research

    Authors: Danny Halawi, Aron Sarmasi, Siena Saltzen, Joshua McCoy

    Abstract: In recent years, machine learning approaches have made dramatic advances, reaching superhuman performance in Go, Atari, and poker variants. These games, and others before them, have served not only as a testbed but have also helped to push the boundaries of AI research. Continuing this tradition, we examine the tabletop game Dominion and discuss the properties that make it well-suited to serve as… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  3. arXiv:2402.18563  [pdf, other

    cs.LG cs.AI cs.CL cs.IR

    Approaching Human-Level Forecasting with Language Models

    Authors: Danny Halawi, Fred Zhang, Chen Yueh-Han, Jacob Steinhardt

    Abstract: Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large data… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  4. arXiv:2307.09476  [pdf, other

    cs.LG cs.AI cs.CL

    Overthinking the Truth: Understanding how Language Models Process False Demonstrations

    Authors: Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt

    Abstract: Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: "overthinking" and "false… ▽ More

    Submitted 12 March, 2024; v1 submitted 18 July, 2023; originally announced July 2023.

  5. arXiv:2303.08112  [pdf, other

    cs.LG

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Authors: Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

    Abstract: We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer. To do so, we train an affine probe for each block in a frozen pretrained model, making it possible to decode every hidden state into a distribution over the vocabulary. Our method, the \emph{tuned lens}, is a refinement of the earlier ``logit lens'' technique… ▽ More

    Submitted 26 November, 2023; v1 submitted 14 March, 2023; originally announced March 2023.