Skip to main content

Showing 1–25 of 25 results for author: Barez, F

.
  1. arXiv:2406.10162  [pdf, other

    cs.AI cs.CL

    Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

    Authors: Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

    Abstract: In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be to… ▽ More

    Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: Make it easier to find samples from the model, and highlight that our operational definition of reward tampering has false positives where the model attempts to complete the task honestly but edits the reward. Add paragraph to conclusion to this effect, and add sentence to figure 1 to this effect

  2. arXiv:2405.08597  [pdf, other

    cs.LG

    Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Aaron Purewal, Csaba Botos, Fabro Steibel, Fazel Keshtkar, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Imperial, Juan Arturo Nolazco, Lori Landay, Matthew Jackson, Phillip H. S. Torr, Trevor Darrell, Yong Lee, Jakob Foerster

    Abstract: Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This reg… ▽ More

    Submitted 29 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

    Comments: Extension of arXiv:2404.17047

  3. arXiv:2405.06409  [pdf, other

    cs.LG cs.AI

    Visualizing Neural Network Imagination

    Authors: Nevan Wichers, Victor Tao, Riccardo Volpato, Fazl Barez

    Abstract: In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) architecture with a decoder network at the end. After training, we apply the decoder to the intermediate representations of the network to visualize what they represe… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  4. arXiv:2404.17047  [pdf, other

    cs.LG

    Near to Mid-term Risks and Opportunities of Open-Source Generative AI

    Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster

    Abstract: In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation i… ▽ More

    Submitted 24 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Accepted to ICML'24 as a position paper

  5. arXiv:2402.15055  [pdf, other

    cs.CL cs.AI cs.LG

    Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

    Authors: Clement Neo, Shay B. Cohen, Fazl Barez

    Abstract: In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate attention mechanisms that activate certain next-token neurons. Our analysis identifies attention heads that recognize contexts relevant to predicting a pa… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 15 pages, 11 figures

  6. arXiv:2402.02619  [pdf, other

    cs.LG cs.CL

    Increasing Trust in Language Models through the Reuse of Verified Circuits

    Authors: Philip Quirke, Clement Neo, Fazl Barez

    Abstract: Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a model can be trained to meet this sta… ▽ More

    Submitted 16 June, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

    Comments: 8 pages, 4 figures

  7. arXiv:2401.05566  [pdf, other

    cs.CR cs.AI cs.CL cs.LG cs.SE

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

    Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More

    Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: updated to add missing acknowledgements

  8. arXiv:2401.01814  [pdf, other

    cs.AI

    Large Language Models Relearn Removed Concepts

    Authors: Michelle Lo, Shay B. Cohen, Fazl Barez

    Abstract: Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that mod… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  9. arXiv:2312.15241  [pdf, ps, other

    cs.AI cs.IR

    Measuring Value Alignment

    Authors: Fazl Barez, Philip Torr

    Abstract: As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical. This paper introduces a novel formalism to quantify the alignment between AI systems and human values, using Markov Decision Processes (MDPs) as the foundational model. We delve into the concept of values as desirable goals tied to actions and no… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

    Comments: arXiv admin note: text overlap with arXiv:2110.09240 by other authors

    Journal ref: NeurIPS 2023 MP2 Workshop

  10. arXiv:2311.04131  [pdf, other

    cs.CL cs.AI cs.LG

    Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

    Authors: Michael Lan, Phillip Torr, Fazl Barez

    Abstract: While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which includ… ▽ More

    Submitted 6 July, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

  11. arXiv:2310.13121  [pdf, other

    cs.LG cs.AI

    Understanding Addition in Transformers

    Authors: Philip Quirke, Fazl Barez

    Abstract: Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition. Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different… ▽ More

    Submitted 23 April, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: 9 pages, 8 figures, accepted by ICLR 2024

  12. arXiv:2310.08164  [pdf, other

    cs.LG

    Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

    Authors: Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, Philip Torr, Fazl Barez

    Abstract: Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed. We coin the term $\textit{Implicit Reward Model}$ (IRM) to refer to the changes that occur to an LLM during RLHF that result in high-reward generations. We interpret IRMs, and measure their divergence from the RLHF reward model used in the fine-tuning process that induced… ▽ More

    Submitted 7 February, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: 19 pages, 5 figures

  13. arXiv:2310.05876  [pdf

    cs.AI

    AI Systems of Concern

    Authors: Kayla Matteucci, Shahar Avin, Fazl Barez, Seán Ó hÉigeartaigh

    Abstract: Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning. We label this cluster of characteristics as "Property X". Most present AI systems are low in "Property X"; however, in the absence of deliberate steering, current research directions may rapidly lead to th… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: 9 pages, 1 figure, 2 tables

  14. arXiv:2310.01870  [pdf, other

    cs.LG

    DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

    Authors: Albert Garde, Esben Kran, Fazl Barez

    Abstract: As large language models (LLMs) become more capable, there is an urgent need for interpretable and transparent tools. Current methods are difficult to implement, and accessible tools to analyze model internals are lacking. To bridge this gap, we present DeepDecipher - an API and interface for probing neurons in transformer models' MLP layers. DeepDecipher makes the outputs of advanced interpretabi… ▽ More

    Submitted 28 November, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: 5 pages (9 total), 1 figure, submitted to NeurIPS 2023 Workshop XAIA

    MSC Class: 68T50 (Primary) 68T05 (Secondary) ACM Class: I.2.7

  15. arXiv:2305.19911  [pdf, other

    cs.LG cs.CL

    Neuron to Graph: Interpreting Language Model Neurons at Scale

    Authors: Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

    Abstract: Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make th… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

  16. arXiv:2305.17553  [pdf, other

    cs.CL cs.AI cs.LG

    Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

    Authors: Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, Fazl Barez

    Abstract: Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we e… ▽ More

    Submitted 3 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: To be published in ACL Findings 2023; for code see https://github.com/apartresearch/specificityplus; for a homepage see https://specificityplus.apartresearch.com/; updated Figures to uniform style

    ACM Class: I.2.7

  17. arXiv:2305.15507  [pdf, other

    cs.CL cs.AI

    The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

    Authors: Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen

    Abstract: Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to prop… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 17 pages, 5 figure, ACL 2023

  18. arXiv:2304.12918  [pdf, other

    cs.LG

    N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

    Authors: Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

    Abstract: Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than cu… ▽ More

    Submitted 22 April, 2023; originally announced April 2023.

    Comments: To be published at ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models

  19. arXiv:2304.11593  [pdf, other

    cs.LG

    System III: Learning with Domain Knowledge for Safety Constraints

    Authors: Fazl Barez, Hosien Hasanbieg, Alesandro Abbate

    Abstract: Reinforcement learning agents naturally learn from extensive exploration. Exploration is costly and can be unsafe in $\textit{safety-critical}$ domains. This paper proposes a novel framework for incorporating domain knowledge to help guide safe exploration and boost sample efficiency. Previous approaches impose constraints, such as regularisation parameters in neural networks, that rely on large s… ▽ More

    Submitted 23 April, 2023; originally announced April 2023.

  20. arXiv:2304.09826  [pdf

    cs.CY cs.AI cs.CL cs.CV cs.LG

    Fairness in AI and Its Long-Term Implications on Society

    Authors: Ondrej Bohdal, Timothy Hospedales, Philip H. S. Torr, Fazl Barez

    Abstract: Successful deployment of artificial intelligence (AI) in various settings has led to numerous positive outcomes for individuals and society. However, AI systems have also been shown to harm parts of the population due to biased predictions. AI fairness focuses on mitigating such biases to ensure AI decision making is not discriminatory towards certain groups. We take a closer look at AI fairness a… ▽ More

    Submitted 19 July, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

    Comments: Stanford Existential Risks Conference 2023

  21. arXiv:2302.13850  [pdf, other

    q-fin.ST cs.LG

    Exploring the Advantages of Transformers for High-Frequency Trading

    Authors: Fazl Barez, Paul Bilokon, Arthur Gervais, Nikita Lisitsyn

    Abstract: This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models. A hybrid Transformer model, called \textbf{HFformer}, is then introduced for time series forecasting which incorporates a Transformer encoder, linear decoder, spiking activations, and quantile loss function… ▽ More

    Submitted 20 February, 2023; originally announced February 2023.

  22. arXiv:2301.12561  [pdf, ps, other

    cs.DB

    Benchmarking Specialized Databases for High-frequency Data

    Authors: Fazl Barez, Paul Bilokon, Ruijie Xiong

    Abstract: This paper presents a benchmarking suite designed for the evaluation and comparison of time series databases for high-frequency data, with a focus on financial applications. The proposed suite comprises of four specialized databases: ClickHouse, InfluxDB, kdb+ and TimescaleDB. The results from the suite demonstrate that kdb+ has the highest performance amongst the tested databases, while also high… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

    Comments: 29 pages, 9 tables, 11 figures

  23. arXiv:2203.08553  [pdf, other

    cs.MA cs.AI

    PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration

    Authors: Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E. Taylor, Wenyuan Tao, Zhen Wang, Fazl Barez

    Abstract: Learning to collaborate is critical in Multi-Agent Reinforcement Learning (MARL). Previous works promote collaboration by maximizing the correlation of agents' behaviors, which is typically characterized by Mutual Information (MI) in different forms. However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder t… ▽ More

    Submitted 21 February, 2023; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: The paper has been accepted by The Thirty-ninth International Conference on Machine Learning (ICML 2022) and the Cooperative AI Workshop at 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

  24. arXiv:2103.02099  [pdf

    cs.AI

    Design of an Affordable Prosthetic Arm Equipped with Deep Learning Vision-Based Manipulation

    Authors: Alishba Imran, William Escobar, Freidoon Barez

    Abstract: Many amputees throughout the world are left with limited options to personally own a prosthetic arm due to the expensive cost, mechanical system complexity, and lack of availability. The three main control methods of prosthetic hands are: (1) body-powered control, (2) extrinsic mechanical control, and (3) myoelectric control. These methods can perform well under a controlled situation but will oft… ▽ More

    Submitted 2 March, 2021; originally announced March 2021.

    Comments: Pre-print paper, 7 pages, 15 figures

  25. arXiv:2010.10993  [pdf

    physics.soc-ph

    A Model for Optimizing the Health and Economic Impacts of Covid-19 under Social Distancing Measures; A Study for the Number of Passengers and their Seating Arrangements in Aircrafts

    Authors: Elaheh Ghorbani, Hamid Molavian, Fred Barez

    Abstract: Covid-19 has had a disastrous economic impact on countries and industries as countries have gone through the lockdown process to reduce the health impact of Covid-19. As countries have started lifting Covid-19 related restrictions, businesses have been allowed to again have on-site customers. However, just a limited number of people are being allowed on-site as long as social distancing measures a… ▽ More

    Submitted 25 January, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

    Comments: 17 pages, 9 figures