Skip to main content

Showing 1–6 of 6 results for author: Wang, T T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.20053  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

    Authors: Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt

    Abstract: Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious d… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 22 pages

  2. arXiv:2406.12843  [pdf, other

    cs.LG cs.AI stat.ML

    Can Go AIs be adversarially robust?

    Authors: Tom Tseng, Euan McLean, Kellin Pelrine, Tony T. Wang, Adam Gleave

    Abstract: Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 67 pages

  3. arXiv:2312.08793  [pdf, other

    cs.LG cs.AI cs.CL cs.CR

    Forbidden Facts: An Investigation of Competing Objectives in Llama-2

    Authors: Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit

    Abstract: LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1… ▽ More

    Submitted 31 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted to the ATTRIB and SoLaR workshops at NeurIPS 2023; (v3: clarified experimental details)

  4. arXiv:2302.07348  [pdf, other

    cs.LG cs.AI stat.ML

    Cliff-Learning

    Authors: Tony T. Wang, Igor Zablotchi, Nir Shavit, Jonathan S. Rosenfeld

    Abstract: We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model clif… ▽ More

    Submitted 6 June, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

    Comments: 16 pages; v2 updates: improved layout, added acknowledgements

  5. arXiv:2211.00241  [pdf, other

    cs.LG cs.AI cs.CR stat.ML

    Adversarial Policies Beat Superhuman Go AIs

    Authors: Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

    Abstract: We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human exper… ▽ More

    Submitted 13 July, 2023; v1 submitted 31 October, 2022; originally announced November 2022.

    Comments: Accepted to ICML 2023, see paper for changelog

    ACM Class: I.2.6

  6. arXiv:2112.10789  [pdf, other

    quant-ph cond-mat.quant-gas cond-mat.str-el cs.LG

    Machine learning discovery of new phases in programmable quantum simulator snapshots

    Authors: Cole Miles, Rhine Samajdar, Sepehr Ebadi, Tout T. Wang, Hannes Pichler, Subir Sachdev, Mikhail D. Lukin, Markus Greiner, Kilian Q. Weinberger, Eun-Ah Kim

    Abstract: Machine learning has recently emerged as a promising approach for studying complex phenomena characterized by rich datasets. In particular, data-centric approaches lend to the possibility of automatically discovering structures in experimental datasets that manual inspection may miss. Here, we introduce an interpretable unsupervised-supervised hybrid machine learning approach, the hybrid-correlati… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: 9 pages, 5 figures + 12 pages, 10 figures appendix

    Journal ref: Physics Review Research 5, 013026 (2023)