Skip to main content

Showing 1–50 of 87 results for author: Belinkov, Y

.
  1. arXiv:2406.16254  [pdf, other

    cs.LG cs.AI cs.CL

    Confidence Regulation Neurons in Language Models

    Authors: Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

    Abstract: Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized b… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 25 pages, 14 figures

  2. arXiv:2406.09325  [pdf, other

    cs.CL

    REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space

    Authors: Tomer Ashuach, Martin Tutek, Yonatan Belinkov

    Abstract: Large language models (LLMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel model editing me… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 18 pages, 3 figures

    ACM Class: I.2.7

  3. arXiv:2405.07788  [pdf, other

    cs.CL

    DEPTH: Discourse Education through Pre-Training Hierarchically

    Authors: Zachary Bamberger, Ofek Glick, Chaim Baskin, Yonatan Belinkov

    Abstract: Language Models (LMs) often struggle with linguistic understanding at the discourse level, even though discourse patterns such as coherence, cohesion, and narrative flow are prevalent in their pre-training data. Current methods address these challenges only after the pre-training phase, relying on expensive human annotated data to align the model. To improve the discourse capabilities of LMs alrea… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: 28 pages, 10 figures, 8 tables

  4. arXiv:2404.09971  [pdf, other

    cs.CL

    Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs

    Authors: Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov

    Abstract: Large language models (LLMs) are susceptible to hallucination, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's computation during generation, using different setups and heuristics. Those works lack separation between different hallucination causes. In this work, we first introduce an approach for constructi… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    ACM Class: I.2.7

  5. arXiv:2403.19887  [pdf, other

    cs.CL cs.LG

    Jamba: A Hybrid Transformer-Mamba Language Model

    Authors: Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham

    Abstract: We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while kee** active parameter usage manageable. This flexible architecture allows reso… ▽ More

    Submitted 3 July, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: Webpage: https://www.ai21.com/jamba

  6. arXiv:2403.19647  [pdf, other

    cs.LG cs.AI cs.CL

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Authors: Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller

    Abstract: We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse featur… ▽ More

    Submitted 31 March, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: Code and data at https://github.com/saprmarks/feature-circuits. Demonstration at https://feature-circuits.xyz

  7. arXiv:2403.17806  [pdf, other

    cs.LG cs.CL

    Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

    Authors: Michael Hanna, Sandro Pezzelle, Yonatan Belinkov

    Abstract: Many recent language model (LM) interpretability studies have adopted the circuits framework, which aims to find the minimal computational subgraph, or circuit, that explains LM behavior on a given task. Most studies determine which edges belong in a LM's circuit by performing causal interventions on each edge independently, but this scales poorly with model size. Edge attribution patching (EAP),… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  8. arXiv:2403.14705  [pdf, other

    cs.AI cs.CL

    Concept-Best-Matching: Evaluating Compositionality in Emergent Communication

    Authors: Boaz Carmeli, Yonatan Belinkov, Ron Meir

    Abstract: Artificial agents that learn to communicate in order to accomplish a given task acquire communication protocols that are typically opaque to a human. A large body of work has attempted to evaluate the emergent communication via various evaluation measures, with \emph{compositionality} featuring as a prominent desired trait. However, current evaluation procedures do not directly expose the composit… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

  9. arXiv:2403.09516  [pdf, other

    cs.CL cs.CY cs.LG

    Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information

    Authors: Shadi Iskander, Kira Radinsky, Yonatan Belinkov

    Abstract: Mitigating social biases typically requires identifying the social groups associated with each data sample. In this paper, we present DAFair, a novel approach to address social bias in language models. Unlike traditional methods that rely on explicit demographic labels, our approach does not require any such information. Instead, we leverage predefined prototypical demographic texts and incorporat… ▽ More

    Submitted 5 April, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

  10. arXiv:2403.05846  [pdf, other

    cs.CV cs.CL

    Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines

    Authors: Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov

    Abstract: Text-to-image diffusion models (T2I) use a latent representation of a text prompt to guide the image generation process. However, the process by which the encoder produces the text representation is unknown. We propose the Diffusion Lens, a method for analyzing the text encoder of T2I models by generating images from its intermediate representations. Using the Diffusion Lens, we perform an extensi… ▽ More

    Submitted 9 March, 2024; originally announced March 2024.

    Comments: Project webpage: tokeron.github.io/DiffusionLensWeb

    ACM Class: I.2.7; I.4.0

  11. arXiv:2402.17371  [pdf, other

    cs.CL

    A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry

    Authors: Michael Toker, Oren Mishali, Ophir Münz-Manor, Benny Kimelfeld, Yonatan Belinkov

    Abstract: There is a large volume of late antique and medieval Hebrew texts. They represent a crucial linguistic and cultural bridge between Biblical and modern Hebrew. Poetry is prominent in these texts and one of its main haracteristics is the frequent use of metaphor. Distinguishing figurative and literal language use is a major task for scholars of the Humanities, especially in the fields of literature,… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: EACL 2024. Project webpage: https://tokeron.github.io/metaphor/

    ACM Class: I.2.7

  12. arXiv:2402.14811  [pdf, other

    cs.CL cs.LG

    Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

    Authors: Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau

    Abstract: Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, w… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: ICLR 2024. 26 pages, 13 figures. Code and data at https://finetuning.baulab.info/

  13. arXiv:2402.12865  [pdf, other

    cs.CL cs.AI cs.LG

    Backward Lens: Projecting Language Model Gradients into the Vocabulary Space

    Authors: Shahar Katz, Yonatan Belinkov, Mor Geva, Lior Wolf

    Abstract: Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, hel** to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first p… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  14. arXiv:2312.07991  [pdf, other

    cs.LG

    Accelerating the Global Aggregation of Local Explanations

    Authors: Alon Mor, Yonatan Belinkov, Benny Kimelfeld

    Abstract: Local explanation methods highlight the input tokens that have a considerable impact on the outcome of classifying the document at hand. For example, the Anchor algorithm applies a statistical analysis of the sensitivity of the classifier to changes in the token. Aggregating local explanations over a dataset provides a global explanation of the model. Such aggregation aims to detect words with the… ▽ More

    Submitted 12 January, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  15. arXiv:2310.15004  [pdf, other

    cs.CL

    When Language Models Fall in Love: Animacy Processing in Transformer Language Models

    Authors: Michael Hanna, Yonatan Belinkov, Sandro Pezzelle

    Abstract: Animacy - whether an entity is alive and sentient - is fundamental to cognitive processing, impacting areas such as memory, vision, and language. However, animacy is not always expressed directly in language: in English it often manifests indirectly, in the form of selectional constraints on verbs and adjectives. This poses a potential issue for transformer language models (LMs): they often train… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: To appear at EMNLP 2023

  16. arXiv:2308.14761  [pdf, other

    cs.CV cs.LG

    Unified Concept Editing in Diffusion Models

    Authors: Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau

    Abstract: Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. We present a method that tackles all issues with a single approach. Our method,… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

  17. arXiv:2308.09124  [pdf, other

    cs.CL

    Linearity of Relation Decoding in Transformer Language Models

    Authors: Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau

    Abstract: Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a fir… ▽ More

    Submitted 15 February, 2024; v1 submitted 17 August, 2023; originally announced August 2023.

  18. arXiv:2308.00225  [pdf, other

    cs.AI cs.CY cs.LG

    Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias

    Authors: Itay Itzhak, Gabriel Stanovsky, Nir Rosenfeld, Yonatan Belinkov

    Abstract: Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. While these tuning methods can help align models with human objectives and generate high-quality text, not much is known about their potential adverse effects. In this work, we investigate the effect of IT and RLHF on decision mak… ▽ More

    Submitted 31 March, 2024; v1 submitted 31 July, 2023; originally announced August 2023.

    Comments: TACL 2024. Presented at ACL 2024. 12 pages

  19. arXiv:2307.06908  [pdf, other

    cs.CL cs.AI

    Generating Benchmarks for Factuality Evaluation of Language Models

    Authors: Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham

    Abstract: Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing methods for factuality evaluation of LLM generation focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent domain specific or rare facts. We propose FACTOR: Factual… ▽ More

    Submitted 4 February, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

  20. arXiv:2306.00738  [pdf, other

    cs.CL cs.CV

    ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

    Authors: Dana Arad, Hadas Orgad, Yonatan Belinkov

    Abstract: Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to text-to-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-i… ▽ More

    Submitted 7 May, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to NAACL 2024 (Main Conference)

    MSC Class: 68T50 ACM Class: I.2.7

  21. arXiv:2305.15054  [pdf, other

    cs.CL cs.LG

    A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis

    Authors: Alessandro Stolfo, Yonatan Belinkov, Mrinmaya Sachan

    Abstract: Mathematical reasoning in large language models (LMs) has garnered significant attention in recent work, but there is a limited understanding of how these models process and store information related to arithmetic tasks within their architecture. In order to improve our understanding of this aspect of language models, we present a mechanistic interpretation of Transformer-based LMs on arithmetic q… ▽ More

    Submitted 20 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023. 18 pages, 19 figures

  22. arXiv:2305.13417  [pdf, other

    cs.CL

    VISIT: Visualizing and Interpreting the Semantic Information Flow of Transformers

    Authors: Shahar Katz, Yonatan Belinkov

    Abstract: Recent advances in interpretability suggest we can project weights and hidden states of transformer-based language models (LMs) to their vocabulary, a transformation that makes them more human interpretable. In this paper, we investigate LM attention heads and memory values, the vectors the models dynamically create and recall while processing a given input. By analyzing the tokens they represent… ▽ More

    Submitted 24 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: EMNLP Findings 2023

    MSC Class: 68T50 ACM Class: I.2.7

  23. arXiv:2305.10204  [pdf, other

    cs.CL cs.AI

    Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection

    Authors: Shadi Iskander, Kira Radinsky, Yonatan Belinkov

    Abstract: Natural language processing models tend to learn and encode social biases present in the data. One popular approach for addressing such biases is to eliminate encoded information from the model's representations. However, current methods are restricted to removing only linearly encoded information. In this work, we propose Iterative Gradient-Based Projection (IGBP), a novel method for removing non… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: This paper will be published in the proceedings of Findings of ACL 2023

  24. arXiv:2303.16992  [pdf, other

    cs.CL

    ContraSim -- A Similarity Measure Based on Contrastive Learning

    Authors: Adir Rahamim, Yonatan Belinkov

    Abstract: Recent work has compared neural network representations via similarity-based analyses to improve model interpretation. The quality of a similarity measure is typically evaluated by its success in assigning a high score to representations that are expected to be matched. However, existing similarity measures perform mediocrely on standard benchmarks. In this work, we develop a new similarity measur… ▽ More

    Submitted 30 May, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

  25. arXiv:2303.08084  [pdf, other

    cs.CV

    Editing Implicit Assumptions in Text-to-Image Diffusion Models

    Authors: Hadas Orgad, Bahjat Kawar, Yonatan Belinkov

    Abstract: Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edi… ▽ More

    Submitted 25 August, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: Project page: https://time-diffusion.github.io/

  26. arXiv:2212.10947  [pdf, other

    cs.CL

    Parallel Context Windows for Large Language Models

    Authors: Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Inbal Magar, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, Yoav Shoham

    Abstract: When applied to processing long text, Large Language Models (LLMs) are limited by their context window. Existing efforts to address this limitation involve training specialized architectures, and cannot be easily applied to off-the-shelf LLMs. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The k… ▽ More

    Submitted 1 August, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)

  27. arXiv:2212.10563  [pdf, other

    cs.CL

    BLIND: Bias Removal With No Demographics

    Authors: Hadas Orgad, Yonatan Belinkov

    Abstract: Models trained on real-world data tend to imitate and amplify social biases. Common methods to mitigate biases require prior information on the types of biases that should be mitigated (e.g., gender or racial bias) and the social groups associated with each data sample. In this work, we introduce BLIND, a method for bias removal with no prior knowledge of the demographics in the dataset. While tra… ▽ More

    Submitted 11 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Accepted to ACL 2023 main conference

    MSC Class: 68T50 ACM Class: I.2.7

  28. arXiv:2212.10380  [pdf, other

    cs.CL cs.IR

    What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

    Authors: Ori Ram, Liat Bezalel, Adi Zicher, Yonatan Belinkov, Jonathan Berant, Amir Globerson

    Abstract: Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that t… ▽ More

    Submitted 24 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023

  29. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  30. arXiv:2211.02412  [pdf, other

    cs.AI cs.MA

    Emergent Quantized Communication

    Authors: Boaz Carmeli, Ron Meir, Yonatan Belinkov

    Abstract: The field of emergent communication aims to understand the characteristics of communication as it emerges from artificial agents solving tasks that require information exchange. Communication with discrete messages is considered a desired characteristic, for both scientific and applied reasons. However, training a multi-agent system with discrete communication is not straightforward, requiring eit… ▽ More

    Submitted 19 January, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    MSC Class: 68T07 ACM Class: I.2.6

  31. arXiv:2210.11471  [pdf, other

    cs.CL

    Choose Your Lenses: Flaws in Gender Bias Evaluation

    Authors: Hadas Orgad, Yonatan Belinkov

    Abstract: Considerable efforts to measure and mitigate gender bias in recent years have led to the introduction of an abundance of tasks, datasets, and metrics used in this vein. In this position paper, we assess the current paradigm of gender bias evaluation and identify several flaws in it. First, we highlight the importance of extrinsic bias metrics that measure how a model's performance on some task is… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

    Comments: Accepted to the 4th Workshop on Gender Bias in Natural Language Processing

    MSC Class: 68T50 ACM Class: I.2.7

  32. arXiv:2210.09404  [pdf, other

    cs.LG cs.IT

    Measures of Information Reflect Memorization Patterns

    Authors: Rachit Bansal, Danish Pruthi, Yonatan Belinkov

    Abstract: Neural networks are known to exploit spurious artifacts (or shortcuts) that co-occur with a target label, exhibiting heuristic memorization. On the other hand, networks have been shown to memorize training examples, resulting in example-level memorization. These kinds of memorization impede generalization of networks beyond their training distributions. Detecting such memorization could be challen… ▽ More

    Submitted 1 February, 2024; v1 submitted 17 October, 2022; originally announced October 2022.

    Comments: 22 pages; NeurIPS 2022. Code and data at https://rachitbansal.github.io/information-measures

  33. arXiv:2210.07229  [pdf, other

    cs.CL cs.LG

    Mass-Editing Memory in a Transformer

    Authors: Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, David Bau

    Abstract: Recent work has shown exciting promise in updating large language models with new memories, so as to replace obsolete information or add specialized knowledge. However, this line of work is predominantly limited to updating single associations. We develop MEMIT, a method for directly updating a language model with many memories, demonstrating experimentally that it can scale up to thousands of ass… ▽ More

    Submitted 1 August, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: 18 pages, 11 figures. Code and data at https://memit.baulab.info

  34. arXiv:2207.14251  [pdf, other

    cs.CL

    Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

    Authors: Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, Yoav Goldberg

    Abstract: Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain exp… ▽ More

    Submitted 24 March, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

    Comments: We received a criticism regarding the validity of the causal formulation in this paper. We will address them in an upcoming version

  35. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  36. arXiv:2206.00259  [pdf, other

    cs.CL cs.AI cs.LG

    IDANI: Inference-time Domain Adaptation via Neuron-level Interventions

    Authors: Omer Antverg, Eyal Ben-David, Yonatan Belinkov

    Abstract: Large pre-trained models are usually fine-tuned on downstream task data, and tested on unseen data. When the train and test data come from different domains, the model is likely to struggle, as it is not adapted to the test domain. We propose a new approach for domain adaptation (DA), using neuron-level interventions: We modify the representation of each test example in specific neurons, resulting… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

    Comments: Our code is available at https://github.com/technion-cs-nlp/idani

  37. arXiv:2205.00445  [pdf, other

    cs.CL cs.AI

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Authors: Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, Moshe Tenenholtz

    Abstract: Huge language models (LMs) have ushered in a new era for AI, serving as a gateway to natural-language-based knowledge tasks. Although an essential element of modern AI, LMs are also inherently limited in a number of ways. We discuss these limitations and how they can be avoided by adopting a systems approach. Conceptualizing the challenge as one that involves knowledge and reasoning in addition to… ▽ More

    Submitted 1 May, 2022; originally announced May 2022.

  38. arXiv:2204.06827  [pdf, other

    cs.CL

    How Gender Debiasing Affects Internal Model Representations, and Why It Matters

    Authors: Hadas Orgad, Seraphina Goldfarb-Tarrant, Yonatan Belinkov

    Abstract: Common studies of gender bias in NLP focus either on extrinsic bias measured by model performance on a downstream task or on intrinsic bias found in models' internal representations. However, the relationship between extrinsic and intrinsic bias is relatively unknown. In this work, we illuminate this relationship by measuring both quantities together: we debias a model during downstream fine-tunin… ▽ More

    Submitted 16 May, 2022; v1 submitted 14 April, 2022; originally announced April 2022.

    Comments: Accepted to NAACL 2022

    MSC Class: 68T50 ACM Class: I.2.7

  39. arXiv:2204.05428  [pdf, other

    cs.CL cs.AI

    A Multilingual Perspective Towards the Evaluation of Attribution Methods in Natural Language Inference

    Authors: Kerem Zaman, Yonatan Belinkov

    Abstract: Most evaluations of attribution methods focus on the English language. In this work, we present a multilingual approach for evaluating attribution methods for the Natural Language Inference (NLI) task in terms of faithfulness and plausibility. First, we introduce a novel cross-lingual strategy to measure faithfulness based on word alignments, which eliminates the drawbacks of erasure-based evaluat… ▽ More

    Submitted 4 June, 2023; v1 submitted 11 April, 2022; originally announced April 2022.

    Comments: 21 pages, 7 figures. Code and data at https://keremzaman.com/explaiNLI/; Published in the Proceedings of EMNLP 2022

    Journal ref: https://aclanthology.org/2022.emnlp-main.101/

  40. arXiv:2202.05262  [pdf, other

    cs.CL cs.LG

    Locating and Editing Factual Associations in GPT

    Authors: Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov

    Abstract: We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modul… ▽ More

    Submitted 13 January, 2023; v1 submitted 10 February, 2022; originally announced February 2022.

    Comments: NeurIPS 2022. 35 pages, 30 figures. Code and data at https://rome.baulab.info/

    ACM Class: I.2.7

  41. arXiv:2110.07483  [pdf, other

    cs.CL

    On the Pitfalls of Analyzing Individual Neurons in Language Models

    Authors: Omer Antverg, Yonatan Belinkov

    Abstract: While many studies have shown that linguistic information is encoded in hidden word representations, few have studied individual neurons, to show how and in which neurons it is encoded. Among these, the common approach is to use an external probe to rank neurons according to their relevance to some linguistic attribute, and to evaluate the obtained ranking using the same probe that produced it. We… ▽ More

    Submitted 1 August, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: ICLR 2022 Main Conference

    ACM Class: I.2.7

  42. arXiv:2109.04095  [pdf, other

    cs.CL

    Debiasing Methods in Natural Language Understanding Make Bias More Accessible

    Authors: Michael Mendelson, Yonatan Belinkov

    Abstract: Model robustness to bias is often determined by the generalization on carefully designed out-of-distribution datasets. Recent debiasing methods in natural language understanding (NLU) improve performance on such datasets by pressuring models into making unbiased predictions. An underlying assumption behind such methods is that this also leads to the discovery of more robust features in the model's… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: Accepted at EMNLP 2021

    MSC Class: 68T50 ACM Class: I.2.7

  43. arXiv:2108.14006  [pdf, other

    cs.CL

    A Generative Approach for Mitigating Structural Biases in Natural Language Inference

    Authors: Dimion Asael, Zachary Ziegler, Yonatan Belinkov

    Abstract: Many natural language inference (NLI) datasets contain biases that allow models to perform well by only using a biased subset of the input, without considering the remainder features. For instance, models are able to make a classification decision by only using the hypothesis, without learning the true relationship between it and the premise. These structural biases lead discriminative models to l… ▽ More

    Submitted 31 August, 2021; originally announced August 2021.

    ACM Class: I.2.7

  44. arXiv:2106.06087  [pdf, other

    cs.CL

    Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

    Authors: Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, Yonatan Belinkov

    Abstract: Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models' preferences for grammatical inflections, as well as whether ne… ▽ More

    Submitted 22 June, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL-IJCNLP 2021

    MSC Class: 68T50 ACM Class: I.2.7

  45. arXiv:2106.05469  [pdf, other

    cs.CL

    Variational Information Bottleneck for Effective Low-Resource Fine-Tuning

    Authors: Rabeeh Karimi Mahabadi, Yonatan Belinkov, James Henderson

    Abstract: While large-scale pretrained language models have obtained impressive results when fine-tuned on a wide variety of tasks, they still often suffer from overfitting in low-resource scenarios. Since such models are general-purpose feature extractors, many of these features are inevitably irrelevant for a given target task. We propose to use Variational Information Bottleneck (VIB) to suppress irrelev… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: ICLR, 2021

  46. arXiv:2104.08142  [pdf, other

    cs.CL cs.LG

    Supervising Model Attention with Human Explanations for Robust Natural Language Inference

    Authors: Joe Stacey, Yonatan Belinkov, Marek Rei

    Abstract: Natural Language Inference (NLI) models are known to learn from biases and artefacts within their training data, impacting how well they generalise to other unseen datasets. Existing de-biasing approaches focus on preventing the models from learning these biases, which can result in restrictive models and lower performance. We instead investigate teaching the model how a human would approach the N… ▽ More

    Submitted 1 May, 2022; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: Accepted at AAAI 2022

  47. arXiv:2102.12452  [pdf, ps, other

    cs.CL

    Probing Classifiers: Promises, Shortcomings, and Advances

    Authors: Yonatan Belinkov

    Abstract: Probing classifiers have emerged as one of the prominent methodologies for interpreting and analyzing deep neural network models of natural language processing. The basic idea is simple -- a classifier is trained to predict some linguistic property from a model's representations -- and has been used to examine a wide variety of models and properties. However, recent studies have demonstrated vario… ▽ More

    Submitted 22 September, 2021; v1 submitted 24 February, 2021; originally announced February 2021.

    Comments: Accepted to Computational Linguistics as a squib

    ACM Class: I.2.7

  48. arXiv:2012.01300  [pdf, other

    cs.CL cs.LG

    Learning from others' mistakes: Avoiding dataset biases without modeling them

    Authors: Victor Sanh, Thomas Wolf, Yonatan Belinkov, Alexander M. Rush

    Abstract: State-of-the-art natural language processing (NLP) models often learn to model dataset biases and surface form correlations instead of features that target the intended underlying task. Previous work has demonstrated effective methods to circumvent these issues when knowledge of the bias is available. We consider cases where the bias issues may not be explicitly identified, and show a method for t… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

    Comments: 15 pages, 6 figures, 6 tables

  49. arXiv:2010.11481  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Similarity Analysis of Self-Supervised Speech Representations

    Authors: Yu-An Chung, Yonatan Belinkov, James Glass

    Abstract: Self-supervised speech representation learning has recently been a prosperous research topic. Many algorithms have been proposed for learning useful representations from large-scale unlabeled data, and their applications to a wide range of speech tasks have also been investigated. However, there has been little research focusing on understanding the properties of existing approaches. In this work,… ▽ More

    Submitted 2 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021. Supplementary materials available at https://github.com/iamyuanchung/ICASSP21-Similarity-Supplementary

  50. arXiv:2010.02695  [pdf, other

    cs.CL

    Analyzing Individual Neurons in Pre-trained Language Models

    Authors: Nadir Durrani, Hassan Sajjad, Fahim Dalvi, Yonatan Belinkov

    Abstract: While a lot of analysis has been carried to demonstrate linguistic knowledge captured by the representations learned within deep NLP models, very little attention has been paid towards individual neurons.We carry outa neuron-level analysis using core linguistic tasks of predicting morphology, syntax and semantics, on pre-trained language models, with questions like: i) do individual neurons in pre… ▽ More

    Submitted 6 October, 2020; originally announced October 2020.

    Comments: Accepted in EMNLP 2020