Skip to main content

Showing 1–6 of 6 results for author: Sachan, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.06782  [pdf, other

    cs.AI cs.CL

    Debating with More Persuasive LLMs Leads to More Truthful Answers

    Authors: Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

    Abstract: Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this… ▽ More

    Submitted 30 May, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: For code please check: https://github.com/ucl-dark/llm_debate

  2. arXiv:2401.05566  [pdf, other

    cs.CR cs.AI cs.CL cs.LG cs.SE

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

    Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More

    Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: updated to add missing acknowledgements

  3. arXiv:2312.06942  [pdf, other

    cs.LG

    AI Control: Improving Safety Despite Intentional Subversion

    Authors: Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

    Abstract: As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evalu… ▽ More

    Submitted 5 January, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Edit: Fix minor typos and clarify abstract

  4. arXiv:2210.01892  [pdf, other

    cs.NE cs.AI cs.LG

    Polysemanticity and Capacity in Neural Networks

    Authors: Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris

    Abstract: Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the op… ▽ More

    Submitted 11 July, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: 22 pages, 7 figures. Corrected typos in Figure 7, improved notation to distinguish column and row vectors, corrected proof in Appendix A, and other misc changes

  5. arXiv:2205.04685  [pdf, other

    cs.CR cs.LG

    DNS based In-Browser Cryptojacking Detection

    Authors: Rohit Kumar Sachan, Rachit Agarwal, Sandeep Kumar Shukla

    Abstract: The metadata aspect of Domain Names (DNs) enables us to perform a behavioral study of DNs and detect if a DN is involved in in-browser cryptojacking. Thus, we are motivated to study different temporal and behavioral aspects of DNs involved in cryptojacking. We use temporal features such as query frequency and query burst along with graph-based features such as degree and diameter, and non-temporal… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

    Comments: Submitted

  6. arXiv:2106.13420  [pdf, other

    cs.CR cs.LG

    Identifying malicious accounts in Blockchains using Domain Names and associated temporal properties

    Authors: Rohit Kumar Sachan, Rachit Agarwal, Sandeep Kumar Shukla

    Abstract: The rise in the adoption of blockchain technology has led to increased illegal activities by cyber-criminals costing billions of dollars. Many machine learning algorithms are applied to detect such illegal behavior. These algorithms are often trained on the transaction behavior and, in some cases, trained on the vulnerabilities that exist in the system. In our approach, we study the feasibility of… ▽ More

    Submitted 25 June, 2021; originally announced June 2021.

    Comments: Submitted to a journal