Search | arXiv e-print repository

Debating with More Persuasive LLMs Leads to More Truthful Answers

Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

Abstract: Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this… ▽ More Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth. △ Less

Submitted 30 May, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

Comments: For code please check: https://github.com/ucl-dark/llm_debate

arXiv:2401.05566 [pdf, other]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety. △ Less

Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: updated to add missing acknowledgements

arXiv:2312.06942 [pdf, other]

AI Control: Improving Safety Despite Intentional Subversion

Authors: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

Abstract: As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evalu… ▽ More As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines. △ Less

Submitted 5 January, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: Edit: Fix minor typos and clarify abstract

arXiv:2210.01892 [pdf, other]

Polysemanticity and Capacity in Neural Networks

Authors: Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris

Abstract: Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the op… ▽ More Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons. △ Less

Submitted 11 July, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: 22 pages, 7 figures. Corrected typos in Figure 7, improved notation to distinguish column and row vectors, corrected proof in Appendix A, and other misc changes

arXiv:2205.04685 [pdf, other]

DNS based In-Browser Cryptojacking Detection

Authors: Rohit Kumar Sachan, Rachit Agarwal, Sandeep Kumar Shukla

Abstract: The metadata aspect of Domain Names (DNs) enables us to perform a behavioral study of DNs and detect if a DN is involved in in-browser cryptojacking. Thus, we are motivated to study different temporal and behavioral aspects of DNs involved in cryptojacking. We use temporal features such as query frequency and query burst along with graph-based features such as degree and diameter, and non-temporal… ▽ More The metadata aspect of Domain Names (DNs) enables us to perform a behavioral study of DNs and detect if a DN is involved in in-browser cryptojacking. Thus, we are motivated to study different temporal and behavioral aspects of DNs involved in cryptojacking. We use temporal features such as query frequency and query burst along with graph-based features such as degree and diameter, and non-temporal features such as the string-based to detect if a DNs is suspect to be involved in the in-browser cryptojacking. Then, we use them to train the Machine Learning (ML) algorithms over different temporal granularities such as 2 hours datasets and complete dataset. Our results show DecisionTrees classifier performs the best with 59.5% Recall on cryptojacked DN, while for unsupervised learning, K-Means with K=2 perform the best. Similarity analysis of the features reveals a minimal divergence between the cryptojacking DNs and other already known malicious DNs. It also reveals the need for improvements in the feature set of state-of-the-art methods to improve their accuracy in detecting in-browser cryptojacking. As added analysis, our signature-based analysis identifies that none-of-the Indian Government websites were involved in cryptojacking during October-December 2021. However, based on the resource utilization, we identify 10 DNs with different properties than others. △ Less

Submitted 10 May, 2022; originally announced May 2022.

Comments: Submitted

arXiv:2106.13420 [pdf, other]

Identifying malicious accounts in Blockchains using Domain Names and associated temporal properties

Authors: Rohit Kumar Sachan, Rachit Agarwal, Sandeep Kumar Shukla

Abstract: The rise in the adoption of blockchain technology has led to increased illegal activities by cyber-criminals costing billions of dollars. Many machine learning algorithms are applied to detect such illegal behavior. These algorithms are often trained on the transaction behavior and, in some cases, trained on the vulnerabilities that exist in the system. In our approach, we study the feasibility of… ▽ More The rise in the adoption of blockchain technology has led to increased illegal activities by cyber-criminals costing billions of dollars. Many machine learning algorithms are applied to detect such illegal behavior. These algorithms are often trained on the transaction behavior and, in some cases, trained on the vulnerabilities that exist in the system. In our approach, we study the feasibility of using metadata such as Domain Name (DN) associated with the account in the blockchain and identify whether an account should be tagged malicious or not. Here, we leverage the temporal aspects attached to the DNs. Our results identify 144930 DNs that show malicious behavior, and out of these, 54114 DNs show persistent malicious behavior over time. Nonetheless, none of these identified malicious DNs were reported in new officially tagged malicious blockchain DNs. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: Submitted to a journal

Showing 1–6 of 6 results for author: Sachan, K