Search | arXiv e-print repository

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Authors: Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, Evan Hubinger

Abstract: In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be to… ▽ More In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove. △ Less

Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: Make it easier to find samples from the model, and highlight that our operational definition of reward tampering has false positives where the model attempts to complete the task honestly but edits the reward. Add paragraph to conclusion to this effect, and add sentence to figure 1 to this effect

arXiv:2405.08597 [pdf, other]

Risks and Opportunities of Open-Source Generative AI

Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Aaron Purewal, Csaba Botos, Fabro Steibel, Fazel Keshtkar, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Imperial, Juan Arturo Nolazco, Lori Landay, Matthew Jackson, Phillip H. S. Torr, Trevor Darrell, Yong Lee, Jakob Foerster

Abstract: Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This reg… ▽ More Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation is likely to put at risk the budding field of open-source generative AI. Using a three-stage framework for Gen AI development (near, mid and long-term), we analyze the risks and opportunities of open-source generative AI models with similar capabilities to the ones currently available (near to mid-term) and with greater capabilities (long-term). We argue that, overall, the benefits of open-source Gen AI outweigh its risks. As such, we encourage the open sourcing of models, training and evaluation data, and provide a set of recommendations and best practices for managing risks associated with open-source generative AI. △ Less

Submitted 29 May, 2024; v1 submitted 14 May, 2024; originally announced May 2024.

Comments: Extension of arXiv:2404.17047

arXiv:2405.06409 [pdf, other]

Visualizing Neural Network Imagination

Authors: Nevan Wichers, Victor Tao, Riccardo Volpato, Fazl Barez

Abstract: In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) architecture with a decoder network at the end. After training, we apply the decoder to the intermediate representations of the network to visualize what they represe… ▽ More In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) architecture with a decoder network at the end. After training, we apply the decoder to the intermediate representations of the network to visualize what they represent. We define a quantitative interpretability metric and use it to demonstrate that hidden states can be highly interpretable on a simple task. We also develop autoencoder and adversarial techniques and show that benefit interpretability. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.17047 [pdf, other]

Near to Mid-term Risks and Opportunities of Open-Source Generative AI

Authors: Francisco Eiras, Aleksandar Petrov, Bertie Vidgen, Christian Schroeder de Witt, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Botos Csaba, Fabro Steibel, Fazl Barez, Genevieve Smith, Gianluca Guadagni, Jon Chun, Jordi Cabot, Joseph Marvin Imperial, Juan A. Nolazco-Flores, Lori Landay, Matthew Jackson, Paul Röttger, Philip H. S. Torr, Trevor Darrell, Yong Suk Lee, Jakob Foerster

Abstract: In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation i… ▽ More In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation is likely to put at risk the budding field of open-source Generative AI. We argue for the responsible open sourcing of generative AI models in the near and medium term. To set the stage, we first introduce an AI openness taxonomy system and apply it to 40 current large language models. We then outline differential benefits and risks of open versus closed source AI and present potential risk mitigation, ranging from best practices to calls for technical and scientific contributions. We hope that this report will add a much needed missing voice to the current public discourse on near to mid-term AI safety and other societal impact. △ Less

Submitted 24 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: Accepted to ICML'24 as a position paper

arXiv:2402.15055 [pdf, other]

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Authors: Clement Neo, Shay B. Cohen, Fazl Barez

Abstract: In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate attention mechanisms that activate certain next-token neurons. Our analysis identifies attention heads that recognize contexts relevant to predicting a pa… ▽ More In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate attention mechanisms that activate certain next-token neurons. Our analysis identifies attention heads that recognize contexts relevant to predicting a particular token, activating the associated neuron through the residual connection. We focus specifically on heads in earlier layers consistently activating the same next-token neuron across similar prompts. Exploring these differential activation patterns reveals that heads that specialize for distinct linguistic contexts are tied to generating certain tokens. Overall, our method combines neural explanations and probing isolated components to illuminate how attention enables context-dependent, specialized processing in LLMs. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 15 pages, 11 figures

arXiv:2402.02619 [pdf, other]

Increasing Trust in Language Models through the Reuse of Verified Circuits

Authors: Philip Quirke, Clement Neo, Fazl Barez

Abstract: Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a model can be trained to meet this sta… ▽ More Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. We show that a model can be trained to meet this standard if built using mathematically and logically specified frameworks. In this paper, we fully verify an auto-regressive transformer model for n-digit integer addition. To exhibit the reusability of verified modules, we insert the trained integer addition model into a larger untrained model and train the combined model to perform both addition and subtraction. We find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model. We discuss how inserting verified task modules into LMs can leverage model reuse to improve verifiability and trustworthiness of language models built using them. The reuse of verified circuits reduces the effort to verify more complex composite models which we believe to be a significant step towards safety of language models. △ Less

Submitted 16 June, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: 8 pages, 4 figures

arXiv:2401.05566 [pdf, other]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec , et al. (14 additional authors not shown)

Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept exa… ▽ More Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety. △ Less

Submitted 17 January, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: updated to add missing acknowledgements

arXiv:2401.01814 [pdf, other]

Large Language Models Relearn Removed Concepts

Authors: Michelle Lo, Shay B. Cohen, Fazl Barez

Abstract: Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that mod… ▽ More Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. Monitoring concept reemergence and develo** techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. Overall, our work strongly demonstrates the resilience and fluidity of concept representations in LLMs post concept removal. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2312.15241 [pdf, ps, other]

Measuring Value Alignment

Authors: Fazl Barez, Philip Torr

Abstract: As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical. This paper introduces a novel formalism to quantify the alignment between AI systems and human values, using Markov Decision Processes (MDPs) as the foundational model. We delve into the concept of values as desirable goals tied to actions and no… ▽ More As artificial intelligence (AI) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical. This paper introduces a novel formalism to quantify the alignment between AI systems and human values, using Markov Decision Processes (MDPs) as the foundational model. We delve into the concept of values as desirable goals tied to actions and norms as behavioral guidelines, aiming to shed light on how they can be used to guide AI decisions. This framework offers a mechanism to evaluate the degree of alignment between norms and values by assessing preference changes across state transitions in a normative world. By utilizing this formalism, AI developers and ethicists can better design and evaluate AI systems to ensure they operate in harmony with human values. The proposed methodology holds potential for a wide range of applications, from recommendation systems emphasizing well-being to autonomous vehicles prioritizing safety. △ Less

Submitted 23 December, 2023; originally announced December 2023.

Comments: arXiv admin note: text overlap with arXiv:2110.09240 by other authors

Journal ref: NeurIPS 2023 MP2 Workshop

arXiv:2311.04131 [pdf, other]

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Authors: Michael Lan, Phillip Torr, Fazl Barez

Abstract: While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which includ… ▽ More While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models. △ Less

Submitted 6 July, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

arXiv:2310.13121 [pdf, other]

Understanding Addition in Transformers

Authors: Philip Quirke, Fazl Barez

Abstract: Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition. Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different… ▽ More Understanding the inner workings of machine learning models like Transformers is vital for their safe and ethical use. This paper provides a comprehensive analysis of a one-layer Transformer model trained to perform n-digit integer addition. Our findings suggest that the model dissects the task into parallel streams dedicated to individual digits, employing varied algorithms tailored to different positions within the digits. Furthermore, we identify a rare scenario characterized by high loss, which we explain. By thoroughly elucidating the model's algorithm, we provide new insights into its functioning. These findings are validated through rigorous testing and mathematical modeling, thereby contributing to the broader fields of model understanding and interpretability. Our approach opens the door for analyzing more complex tasks and multi-layer Transformer models. △ Less

Submitted 23 April, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

Comments: 9 pages, 8 figures, accepted by ICLR 2024

arXiv:2310.08164 [pdf, other]

Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

Authors: Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, Philip Torr, Fazl Barez

Abstract: Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed. We coin the term $\textit{Implicit Reward Model}$ (IRM) to refer to the changes that occur to an LLM during RLHF that result in high-reward generations. We interpret IRMs, and measure their divergence from the RLHF reward model used in the fine-tuning process that induced… ▽ More Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed. We coin the term $\textit{Implicit Reward Model}$ (IRM) to refer to the changes that occur to an LLM during RLHF that result in high-reward generations. We interpret IRMs, and measure their divergence from the RLHF reward model used in the fine-tuning process that induced them. By fitting a linear function to an LLM's IRM, a reward model with the same type signature as the RLHF reward model is constructed, allowing for direct comparison. Additionally, we validate our construction of the IRM through cross-comparison with classifications of features generated by an LLM based on their relevance to the RLHF reward model. Better comprehending IRMs can help minimize discrepencies between LLM behavior and training objectives, which we believe to be an essential component of the $\textit{safety}$ and $\textit{alignment}$ of LLMs. △ Less

Submitted 7 February, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

Comments: 19 pages, 5 figures

arXiv:2310.05876 [pdf]

AI Systems of Concern

Authors: Kayla Matteucci, Shahar Avin, Fazl Barez, Seán Ó hÉigeartaigh

Abstract: Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning. We label this cluster of characteristics as "Property X". Most present AI systems are low in "Property X"; however, in the absence of deliberate steering, current research directions may rapidly lead to th… ▽ More Concerns around future dangers from advanced AI often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning. We label this cluster of characteristics as "Property X". Most present AI systems are low in "Property X"; however, in the absence of deliberate steering, current research directions may rapidly lead to the emergence of highly capable AI systems that are also high in "Property X". We argue that "Property X" characteristics are intrinsically dangerous, and when combined with greater capabilities will result in AI systems for which safety and control is difficult to guarantee. Drawing on several scholars' alternative frameworks for possible AI research trajectories, we argue that most of the proposed benefits of advanced AI can be obtained by systems designed to minimise this property. We then propose indicators and governance interventions to identify and limit the development of systems with risky "Property X" characteristics. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: 9 pages, 1 figure, 2 tables

arXiv:2310.01870 [pdf, other]

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

Authors: Albert Garde, Esben Kran, Fazl Barez

Abstract: As large language models (LLMs) become more capable, there is an urgent need for interpretable and transparent tools. Current methods are difficult to implement, and accessible tools to analyze model internals are lacking. To bridge this gap, we present DeepDecipher - an API and interface for probing neurons in transformer models' MLP layers. DeepDecipher makes the outputs of advanced interpretabi… ▽ More As large language models (LLMs) become more capable, there is an urgent need for interpretable and transparent tools. Current methods are difficult to implement, and accessible tools to analyze model internals are lacking. To bridge this gap, we present DeepDecipher - an API and interface for probing neurons in transformer models' MLP layers. DeepDecipher makes the outputs of advanced interpretability techniques for LLMs readily available. The easy-to-use interface also makes inspecting these complex models more intuitive. This paper outlines DeepDecipher's design and capabilities. We demonstrate how to analyze neurons, compare models, and gain insights into model behavior. For example, we contrast DeepDecipher's functionality with similar tools like Neuroscope and OpenAI's Neuron Explainer. DeepDecipher enables efficient, scalable analysis of LLMs. By granting access to state-of-the-art interpretability methods, DeepDecipher makes LLMs more transparent, trustworthy, and safe. Researchers, engineers, and developers can quickly diagnose issues, audit systems, and advance the field. △ Less

Submitted 28 November, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: 5 pages (9 total), 1 figure, submitted to NeurIPS 2023 Workshop XAIA

MSC Class: 68T50 (Primary) 68T05 (Secondary) ACM Class: I.2.7

arXiv:2305.19911 [pdf, other]

Neuron to Graph: Interpreting Language Model Neurons at Scale

Authors: Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl Barez

Abstract: Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make th… ▽ More Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at https://github.com/alexjfoote/Neuron2Graph. △ Less

Submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.17553 [pdf, other]

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Authors: Jason Hoelscher-Obermaier, Julia Persson, Esben Kran, Ioannis Konstas, Fazl Barez

Abstract: Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we e… ▽ More Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we extend the metrics used for measuring specificity by a principled KL divergence-based metric. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity. Our findings highlight the need for improved specificity benchmarks that identify and prevent unwanted side effects. △ Less

Submitted 3 June, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

Comments: To be published in ACL Findings 2023; for code see https://github.com/apartresearch/specificityplus; for a homepage see https://specificityplus.apartresearch.com/; updated Figures to uniform style

ACM Class: I.2.7

arXiv:2305.15507 [pdf, other]

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Authors: Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen

Abstract: Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to prop… ▽ More Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 17 pages, 5 figure, ACL 2023

arXiv:2304.12918 [pdf, other]

N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models

Authors: Alex Foote, Neel Nanda, Esben Kran, Ionnis Konstas, Fazl Barez

Abstract: Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than cu… ▽ More Understanding the function of individual neurons within language models is essential for mechanistic interpretability research. We propose $\textbf{Neuron to Graph (N2G)}$, a tool which takes a neuron and its dataset examples, and automatically distills the neuron's behaviour on those examples to an interpretable graph. This presents a less labour intensive approach to interpreting neurons than current manual methods, that will better scale these methods to Large Language Models (LLMs). We use truncation and saliency methods to only present the important tokens, and augment the dataset examples with more diverse samples to better capture the extent of neuron behaviour. These graphs can be visualised to aid manual interpretation by researchers, but can also output token activations on text to compare to the neuron's ground truth activations for automatic validation. N2G represents a step towards scalable interpretability methods by allowing us to convert neurons in an LLM to interpretable representations of measurable quality. △ Less

Submitted 22 April, 2023; originally announced April 2023.

Comments: To be published at ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models

arXiv:2304.11593 [pdf, other]

System III: Learning with Domain Knowledge for Safety Constraints

Authors: Fazl Barez, Hosien Hasanbieg, Alesandro Abbate

Abstract: Reinforcement learning agents naturally learn from extensive exploration. Exploration is costly and can be unsafe in $\textit{safety-critical}$ domains. This paper proposes a novel framework for incorporating domain knowledge to help guide safe exploration and boost sample efficiency. Previous approaches impose constraints, such as regularisation parameters in neural networks, that rely on large s… ▽ More Reinforcement learning agents naturally learn from extensive exploration. Exploration is costly and can be unsafe in $\textit{safety-critical}$ domains. This paper proposes a novel framework for incorporating domain knowledge to help guide safe exploration and boost sample efficiency. Previous approaches impose constraints, such as regularisation parameters in neural networks, that rely on large sample sets and often are not suitable for safety-critical domains where agents should almost always avoid unsafe actions. In our approach, called $\textit{System III}$, which is inspired by psychologists' notions of the brain's $\textit{System I}$ and $\textit{System II}$, we represent domain expert knowledge of safety in form of first-order logic. We evaluate the satisfaction of these constraints via p-norms in state vector space. In our formulation, constraints are analogous to hazards, objects, and regions of state that have to be avoided during exploration. We evaluated the effectiveness of the proposed method on OpenAI's Gym and Safety-Gym environments. In all tasks, including classic Control and Safety Games, we show that our approach results in safer exploration and sample efficiency. △ Less

Submitted 23 April, 2023; originally announced April 2023.

arXiv:2304.09826 [pdf]

Fairness in AI and Its Long-Term Implications on Society

Authors: Ondrej Bohdal, Timothy Hospedales, Philip H. S. Torr, Fazl Barez

Abstract: Successful deployment of artificial intelligence (AI) in various settings has led to numerous positive outcomes for individuals and society. However, AI systems have also been shown to harm parts of the population due to biased predictions. AI fairness focuses on mitigating such biases to ensure AI decision making is not discriminatory towards certain groups. We take a closer look at AI fairness a… ▽ More Successful deployment of artificial intelligence (AI) in various settings has led to numerous positive outcomes for individuals and society. However, AI systems have also been shown to harm parts of the population due to biased predictions. AI fairness focuses on mitigating such biases to ensure AI decision making is not discriminatory towards certain groups. We take a closer look at AI fairness and analyze how lack of AI fairness can lead to deepening of biases over time and act as a social stressor. More specifically, we discuss how biased models can lead to more negative real-world outcomes for certain groups, which may then become more prevalent by deploying new AI models trained on increasingly biased data, resulting in a feedback loop. If the issues persist, they could be reinforced by interactions with other risks and have severe implications on society in the form of social unrest. We examine current strategies for improving AI fairness, assess their limitations in terms of real-world deployment, and explore potential paths forward to ensure we reap AI's benefits without causing society's collapse. △ Less

Submitted 19 July, 2023; v1 submitted 16 April, 2023; originally announced April 2023.

Comments: Stanford Existential Risks Conference 2023

arXiv:2302.13850 [pdf, other]

Exploring the Advantages of Transformers for High-Frequency Trading

Authors: Fazl Barez, Paul Bilokon, Arthur Gervais, Nikita Lisitsyn

Abstract: This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models. A hybrid Transformer model, called \textbf{HFformer}, is then introduced for time series forecasting which incorporates a Transformer encoder, linear decoder, spiking activations, and quantile loss function… ▽ More This paper explores the novel deep learning Transformers architectures for high-frequency Bitcoin-USDT log-return forecasting and compares them to the traditional Long Short-Term Memory models. A hybrid Transformer model, called \textbf{HFformer}, is then introduced for time series forecasting which incorporates a Transformer encoder, linear decoder, spiking activations, and quantile loss function, and does not use position encoding. Furthermore, possible high-frequency trading strategies for use with the HFformer model are discussed, including trade sizing, trading signal aggregation, and minimal trading threshold. Ultimately, the performance of the HFformer and Long Short-Term Memory models are assessed and results indicate that the HFformer achieves a higher cumulative PnL than the LSTM when trading with multiple signals during backtesting. △ Less

Submitted 20 February, 2023; originally announced February 2023.

arXiv:2301.12561 [pdf, ps, other]

Benchmarking Specialized Databases for High-frequency Data

Authors: Fazl Barez, Paul Bilokon, Ruijie Xiong

Abstract: This paper presents a benchmarking suite designed for the evaluation and comparison of time series databases for high-frequency data, with a focus on financial applications. The proposed suite comprises of four specialized databases: ClickHouse, InfluxDB, kdb+ and TimescaleDB. The results from the suite demonstrate that kdb+ has the highest performance amongst the tested databases, while also high… ▽ More This paper presents a benchmarking suite designed for the evaluation and comparison of time series databases for high-frequency data, with a focus on financial applications. The proposed suite comprises of four specialized databases: ClickHouse, InfluxDB, kdb+ and TimescaleDB. The results from the suite demonstrate that kdb+ has the highest performance amongst the tested databases, while also highlighting the strengths and weaknesses of each of the databases. The benchmarking suite was designed to provide an objective measure of the performance of these databases as well as to compare their capabilities for different types of data. This provides valuable insights into the suitability of different time series databases for different use cases and provides benchmarks that can be used to inform system design decisions. △ Less

Submitted 29 January, 2023; originally announced January 2023.

Comments: 29 pages, 9 tables, 11 figures

arXiv:2203.08553 [pdf, other]

PMIC: Improving Multi-Agent Reinforcement Learning with Progressive Mutual Information Collaboration

Authors: Pengyi Li, Hongyao Tang, Tianpei Yang, Xiaotian Hao, Tong Sang, Yan Zheng, Jianye Hao, Matthew E. Taylor, Wenyuan Tao, Zhen Wang, Fazl Barez

Abstract: Learning to collaborate is critical in Multi-Agent Reinforcement Learning (MARL). Previous works promote collaboration by maximizing the correlation of agents' behaviors, which is typically characterized by Mutual Information (MI) in different forms. However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder t… ▽ More Learning to collaborate is critical in Multi-Agent Reinforcement Learning (MARL). Previous works promote collaboration by maximizing the correlation of agents' behaviors, which is typically characterized by Mutual Information (MI) in different forms. However, we reveal sub-optimal collaborative behaviors also emerge with strong correlations, and simply maximizing the MI can, surprisingly, hinder the learning towards better collaboration. To address this issue, we propose a novel MARL framework, called Progressive Mutual Information Collaboration (PMIC), for more effective MI-driven collaboration. PMIC uses a new collaboration criterion measured by the MI between global states and joint actions. Based on this criterion, the key idea of PMIC is maximizing the MI associated with superior collaborative behaviors and minimizing the MI associated with inferior ones. The two MI objectives play complementary roles by facilitating better collaborations while avoiding falling into sub-optimal ones. Experiments on a wide range of MARL benchmarks show the superior performance of PMIC compared with other algorithms. △ Less

Submitted 21 February, 2023; v1 submitted 16 March, 2022; originally announced March 2022.

Comments: The paper has been accepted by The Thirty-ninth International Conference on Machine Learning (ICML 2022) and the Cooperative AI Workshop at 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2103.02099 [pdf]

Design of an Affordable Prosthetic Arm Equipped with Deep Learning Vision-Based Manipulation

Authors: Alishba Imran, William Escobar, Freidoon Barez

Abstract: Many amputees throughout the world are left with limited options to personally own a prosthetic arm due to the expensive cost, mechanical system complexity, and lack of availability. The three main control methods of prosthetic hands are: (1) body-powered control, (2) extrinsic mechanical control, and (3) myoelectric control. These methods can perform well under a controlled situation but will oft… ▽ More Many amputees throughout the world are left with limited options to personally own a prosthetic arm due to the expensive cost, mechanical system complexity, and lack of availability. The three main control methods of prosthetic hands are: (1) body-powered control, (2) extrinsic mechanical control, and (3) myoelectric control. These methods can perform well under a controlled situation but will often break down in clinical and everyday use due to poor robustness, weak adaptability, long-term training, and heavy mental burden during use. This paper lays the complete outline of the design process of an affordable and easily accessible novel prosthetic arm that reduces the cost of prosthetics from $10,000 to $700 on average. The 3D printed prosthetic arm is equipped with a depth camera and closed-loop off-policy deep learning algorithm to help form grasps to the object in view. Current work in reinforcement learning masters only individual skills and is heavily focused on parallel jaw grippers for in-hand manipulation. In order to create generalization, which better performs real-world manipulation, the focus is specifically on using the general framework of Markov Decision Process (MDP) through scalable learning with off-policy algorithms such as deep deterministic policy gradient (DDPG) and to study this question in the context of gras** a prosthetic arm. We were able to achieve a 78% grasp success rate on previously unseen objects and generalize across multiple objects for manipulation tasks. This work will make prosthetics cheaper, easier to use and accessible globally for amputees. Future work includes applying similar approaches to other medical assistive devices where a human is interacting with a machine to complete a task. △ Less

Submitted 2 March, 2021; originally announced March 2021.

Comments: Pre-print paper, 7 pages, 15 figures

arXiv:2010.10993 [pdf]

A Model for Optimizing the Health and Economic Impacts of Covid-19 under Social Distancing Measures; A Study for the Number of Passengers and their Seating Arrangements in Aircrafts

Authors: Elaheh Ghorbani, Hamid Molavian, Fred Barez

Abstract: Covid-19 has had a disastrous economic impact on countries and industries as countries have gone through the lockdown process to reduce the health impact of Covid-19. As countries have started lifting Covid-19 related restrictions, businesses have been allowed to again have on-site customers. However, just a limited number of people are being allowed on-site as long as social distancing measures a… ▽ More Covid-19 has had a disastrous economic impact on countries and industries as countries have gone through the lockdown process to reduce the health impact of Covid-19. As countries have started lifting Covid-19 related restrictions, businesses have been allowed to again have on-site customers. However, just a limited number of people are being allowed on-site as long as social distancing measures are being followed. This has resulted in heavy burdens on businesses as their number of customers have decreased substantially. In this study, we propose a model to minimize the economic impact of Covid-19 for businesses that have implemented social distancing measures, as well as to minimize the health impact of Covid-19 for their customers and employees. We introduce the quantity Spread in which minimizing Spread gives the optimum number and arrangement of people at a given site while applying social distancing measures. We apply our model to a real-world scenario and optimize the number of passengers and their arrangements under a social distancing measure for two different popular aircraft seat layouts using the Annealing Monte Carlo technique. We obtain the optimal numbers and optimal arrangements of passengers considering both family groups and individual passengers for the social distancing measure. The obtained optimal arrangements of passengers show complex patterns with groups and individual passengers mixed in complex and non-trivial ways. This demonstrates the necessity of using our model or its variants to find these optimal arrangements. In addition, we show that any other arrangements of passengers with the same number of passengers is a suboptimal arrangement with higher health risks as a result of less distance between passengers. Our model could be implemented for other social situations such as sports events, theaters, medical centers, etc. △ Less

Submitted 25 January, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: 17 pages, 9 figures

Showing 1–25 of 25 results for author: Barez, F