Search | arXiv e-print repository

arXiv:2406.20053 [pdf, other]

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Authors: Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt

Abstract: Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious d… ▽ More Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: 22 pages

arXiv:2406.12843 [pdf, other]

Can Go AIs be adversarially robust?

Authors: Tom Tseng, Euan McLean, Kellin Pelrine, Tony T. Wang, Adam Gleave

Abstract: Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect… ▽ More Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect against previously discovered attacks. Unfortunately, we also find that none of these defenses are able to withstand adaptive attacks. In particular, we are able to train new adversaries that reliably defeat our defended agents by causing them to blunder in ways humans would not. Our results suggest that building robust AI systems is challenging even in narrow domains such as Go. For interactive examples of attacks and a link to our codebase, see https://goattack.far.ai. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 67 pages

arXiv:2312.08793 [pdf, other]

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

Authors: Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit

Abstract: LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1… ▽ More LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. However, these components are fairly heterogeneous and many operate using faulty heuristics. We discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call The California Attack. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. Project website available at https://forbiddenfacts.github.io . △ Less

Submitted 31 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted to the ATTRIB and SoLaR workshops at NeurIPS 2023; (v3: clarified experimental details)

arXiv:2302.07348 [pdf, other]

Cliff-Learning

Authors: Tony T. Wang, Igor Zablotchi, Nir Shavit, Jonathan S. Rosenfeld

Abstract: We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model clif… ▽ More We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model cliff-learning and study toy models of the phenomenon. We observe that the degree of cliff-learning reflects the degree of compatibility between the priors of a learning algorithm and the task being learned. △ Less

Submitted 6 June, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: 16 pages; v2 updates: improved layout, added acknowledgements

arXiv:2211.00241 [pdf, other]

Adversarial Policies Beat Superhuman Go AIs

Authors: Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

Abstract: We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human exper… ▽ More We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/. △ Less

Submitted 13 July, 2023; v1 submitted 31 October, 2022; originally announced November 2022.

Comments: Accepted to ICML 2023, see paper for changelog

ACM Class: I.2.6

arXiv:2112.10789 [pdf, other]

doi 10.1103/PhysRevResearch.5.013026

Machine learning discovery of new phases in programmable quantum simulator snapshots

Authors: Cole Miles, Rhine Samajdar, Sepehr Ebadi, Tout T. Wang, Hannes Pichler, Subir Sachdev, Mikhail D. Lukin, Markus Greiner, Kilian Q. Weinberger, Eun-Ah Kim

Abstract: Machine learning has recently emerged as a promising approach for studying complex phenomena characterized by rich datasets. In particular, data-centric approaches lend to the possibility of automatically discovering structures in experimental datasets that manual inspection may miss. Here, we introduce an interpretable unsupervised-supervised hybrid machine learning approach, the hybrid-correlati… ▽ More Machine learning has recently emerged as a promising approach for studying complex phenomena characterized by rich datasets. In particular, data-centric approaches lend to the possibility of automatically discovering structures in experimental datasets that manual inspection may miss. Here, we introduce an interpretable unsupervised-supervised hybrid machine learning approach, the hybrid-correlation convolutional neural network (Hybrid-CCNN), and apply it to experimental data generated using a programmable quantum simulator based on Rydberg atom arrays. Specifically, we apply Hybrid-CCNN to analyze new quantum phases on square lattices with programmable interactions. The initial unsupervised dimensionality reduction and clustering stage first reveals five distinct quantum phase regions. In a second supervised stage, we refine these phase boundaries and characterize each phase by training fully interpretable CCNNs and extracting the relevant correlations for each phase. The characteristic spatial weightings and snippets of correlations specifically recognized in each phase capture quantum fluctuations in the striated phase and identify two previously undetected phases, the rhombic and boundary-ordered phases. These observations demonstrate that a combination of programmable quantum simulators with machine learning can be used as a powerful tool for detailed exploration of correlated quantum states of matter. △ Less

Submitted 20 December, 2021; originally announced December 2021.

Comments: 9 pages, 5 figures + 12 pages, 10 figures appendix

Journal ref: Physics Review Research 5, 013026 (2023)

Showing 1–6 of 6 results for author: Wang, T T