-
What Makes Two Language Models Think Alike?
Authors:
Jeanne Salle,
Louis Jalouzot,
Nur Lan,
Emmanuel Chemla,
Yair Lakretz
Abstract:
Do architectural differences significantly affect the way models represent and process language? We propose a new approach, based on metric-learning encoding models (MLEMs), as a first step to answer this question. The approach provides a feature-based comparison of how any two layers of any two models represent linguistic information. We apply the method to BERT, GPT-2 and Mamba. Unlike previous…
▽ More
Do architectural differences significantly affect the way models represent and process language? We propose a new approach, based on metric-learning encoding models (MLEMs), as a first step to answer this question. The approach provides a feature-based comparison of how any two layers of any two models represent linguistic information. We apply the method to BERT, GPT-2 and Mamba. Unlike previous methods, MLEMs offer a transparent comparison, by identifying the specific linguistic features responsible for similarities and differences. More generally, the method uses formal, symbolic descriptions of a domain, and use these to compare neural representations. As such, the approach can straightforwardly be extended to other domains, such as speech and vision, and to other neural systems, including human brains.
△ Less
Submitted 24 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation
Authors:
Nicolas Guerin,
Shane Steinert-Threlkeld,
Emmanuel Chemla
Abstract:
Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical,…
▽ More
Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Metric-Learning Encoding Models Identify Processing Profiles of Linguistic Features in BERT's Representations
Authors:
Louis Jalouzot,
Robin Sobczyk,
Bastien Lhopitallier,
Jeanne Salle,
Nur Lan,
Emmanuel Chemla,
Yair Lakretz
Abstract:
We introduce Metric-Learning Encoding Models (MLEMs) as a new approach to understand how neural systems represent the theoretical features of the objects they process. As a proof-of-concept, we apply MLEMs to neural representations extracted from BERT, and track a wide variety of linguistic features (e.g., tense, subject person, clause type, clause embedding). We find that: (1) linguistic features…
▽ More
We introduce Metric-Learning Encoding Models (MLEMs) as a new approach to understand how neural systems represent the theoretical features of the objects they process. As a proof-of-concept, we apply MLEMs to neural representations extracted from BERT, and track a wide variety of linguistic features (e.g., tense, subject person, clause type, clause embedding). We find that: (1) linguistic features are ordered: they separate representations of sentences to different degrees in different layers; (2) neural representations are organized hierarchically: in some layers, we find clusters of representations nested within larger clusters, following successively important linguistic features; (3) linguistic features are disentangled in middle layers: distinct, selective units are activated by distinct linguistic features. Methodologically, MLEMs are superior (4) to multivariate decoding methods, being more robust to type-I errors, and (5) to univariate encoding methods, in being able to predict both local and distributed representations. Together, this demonstrates the utility of Metric-Learning Encoding Methods for studying how linguistic features are neurally encoded in language models and the advantage of MLEMs over traditional methods. MLEMs can be extended to other domains (e.g. vision) and to other neural systems, such as the human brain.
△ Less
Submitted 18 February, 2024;
originally announced February 2024.
-
Bridging the Empirical-Theoretical Gap in Neural Network Formal Language Learning Using Minimum Description Length
Authors:
Nur Lan,
Emmanuel Chemla,
Roni Katzir
Abstract:
Neural networks offer good approximation to many tasks but consistently fail to reach perfect generalization, even when theoretical work shows that such perfect solutions can be expressed by certain architectures. Using the task of formal language learning, we focus on one simple formal language and show that the theoretically correct solution is in fact not an optimum of commonly used objectives…
▽ More
Neural networks offer good approximation to many tasks but consistently fail to reach perfect generalization, even when theoretical work shows that such perfect solutions can be expressed by certain architectures. Using the task of formal language learning, we focus on one simple formal language and show that the theoretically correct solution is in fact not an optimum of commonly used objectives -- even with regularization techniques that according to common wisdom should lead to simple weights and good generalization (L1, L2) or other meta-heuristics (early-stop**, dropout). On the other hand, replacing standard targets with the Minimum Description Length objective (MDL) results in the correct solution being an optimum.
△ Less
Submitted 6 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Minimum Description Length Hopfield Networks
Authors:
Matan Abudy,
Nur Lan,
Emmanuel Chemla,
Roni Katzir
Abstract:
Associative memory architectures are designed for memorization but also offer, through their retrieval method, a form of generalization to unseen inputs: stored memories can be seen as prototypes from this point of view. Focusing on Modern Hopfield Networks (MHN), we show that a large memorization capacity undermines the generalization opportunity. We offer a solution to better optimize this trade…
▽ More
Associative memory architectures are designed for memorization but also offer, through their retrieval method, a form of generalization to unseen inputs: stored memories can be seen as prototypes from this point of view. Focusing on Modern Hopfield Networks (MHN), we show that a large memorization capacity undermines the generalization opportunity. We offer a solution to better optimize this tradeoff. It relies on Minimum Description Length (MDL) to determine during training which memories to store, as well as how many of them.
△ Less
Submitted 11 November, 2023;
originally announced November 2023.
-
Benchmarking Neural Network Generalization for Grammar Induction
Authors:
Nur Lan,
Emmanuel Chemla,
Roni Katzir
Abstract:
How well do neural networks generalize? Even for grammar induction tasks, where the target generalization is fully known, previous works have left the question open, testing very limited ranges beyond the training set and using different success criteria. We provide a measure of neural network generalization based on fully specified formal languages. Given a model and a formal grammar, the method…
▽ More
How well do neural networks generalize? Even for grammar induction tasks, where the target generalization is fully known, previous works have left the question open, testing very limited ranges beyond the training set and using different success criteria. We provide a measure of neural network generalization based on fully specified formal languages. Given a model and a formal grammar, the method assigns a generalization score representing how well a model generalizes to unseen samples in inverse relation to the amount of data it was trained on. The benchmark includes languages such as $a^nb^n$, $a^nb^nc^n$, $a^nb^mc^{n+m}$, and Dyck-1 and 2. We evaluate selected architectures using the benchmark and find that networks trained with a Minimum Description Length objective (MDL) generalize better and using less data than networks trained using standard loss functions. The benchmark is available at https://github.com/taucompling/bliss.
△ Less
Submitted 25 August, 2023; v1 submitted 16 August, 2023;
originally announced August 2023.
-
Minimum Description Length Recurrent Neural Networks
Authors:
Nur Lan,
Michal Geyer,
Emmanuel Chemla,
Roni Katzir
Abstract:
We train neural networks to optimize a Minimum Description Length score, i.e., to balance between the complexity of the network and its accuracy at a task. We show that networks optimizing this objective function master tasks involving memory challenges and go beyond context-free languages. These learners master languages such as $a^nb^n$, $a^nb^nc^n$, $a^nb^{2n}$, $a^nb^mc^{n+m}$, and they perfor…
▽ More
We train neural networks to optimize a Minimum Description Length score, i.e., to balance between the complexity of the network and its accuracy at a task. We show that networks optimizing this objective function master tasks involving memory challenges and go beyond context-free languages. These learners master languages such as $a^nb^n$, $a^nb^nc^n$, $a^nb^{2n}$, $a^nb^mc^{n+m}$, and they perform addition. Moreover, they often do so with 100% accuracy. The networks are small, and their inner workings are transparent. We thus provide formal proofs that their perfect accuracy holds not only on a given test set, but for any input sequence. To our knowledge, no other connectionist model has been shown to capture the underlying grammars for these languages in full generality.
△ Less
Submitted 31 March, 2022; v1 submitted 31 October, 2021;
originally announced November 2021.
-
On the Spontaneous Emergence of Discrete and Compositional Signals
Authors:
Nur Geffen Lan,
Emmanuel Chemla,
Shane Steinert-Threlkeld
Abstract:
We propose a general framework to study language emergence through signaling games with neural agents. Using a continuous latent space, we are able to (i) train using backpropagation, (ii) show that discrete messages nonetheless naturally emerge. We explore whether categorical perception effects follow and show that the messages are not compositional.
We propose a general framework to study language emergence through signaling games with neural agents. Using a continuous latent space, we are able to (i) train using backpropagation, (ii) show that discrete messages nonetheless naturally emerge. We explore whether categorical perception effects follow and show that the messages are not compositional.
△ Less
Submitted 30 April, 2020;
originally announced May 2020.
-
Suszko's Problem: Mixed Consequence and Compositionality
Authors:
Emmanuel Chemla,
Paul Egré
Abstract:
Suszko's problem is the problem of finding the minimal number of truth values needed to semantically characterize a syntactic consequence relation. Suszko proved that every Tarskian consequence relation can be characterized using only two truth values. Malinowski showed that this number can equal three if some of Tarski's structural constraints are relaxed. By so doing, Malinowski introduced a cas…
▽ More
Suszko's problem is the problem of finding the minimal number of truth values needed to semantically characterize a syntactic consequence relation. Suszko proved that every Tarskian consequence relation can be characterized using only two truth values. Malinowski showed that this number can equal three if some of Tarski's structural constraints are relaxed. By so doing, Malinowski introduced a case of so-called mixed consequence, allowing the notion of a designated value to vary between the premises and the conclusions of an argument. In this paper we give a more systematic perspective on Suszko's problem and on mixed consequence. First, we prove general representation theorems relating structural properties of a consequence relation to their semantic interpretation, uncovering the semantic counterpart of substitution-invariance, and establishing that (intersective) mixed consequence is fundamentally the semantic counterpart of the structural property of monotonicity. We use those to derive maximum-rank results proved recently in a different setting by French and Ripley, as well as by Blasio, Marcos and Wansing, for logics with various structural properties (reflexivity, transitivity, none, or both). We strengthen these results into exact rank results for non-permeable logics (roughly, those which distinguish the role of premises and conclusions). We discuss the underlying notion of rank, and the associated reduction proposed independently by Scott and Suszko. As emphasized by Suszko, that reduction fails to preserve compositionality in general, meaning that the resulting semantics is no longer truth-functional. We propose a modification of that notion of reduction, allowing us to prove that over compact logics with what we call regular connectives, rank results are maintained even if we request the preservation of truth-functionality and additional semantic properties.
△ Less
Submitted 9 February, 2019; v1 submitted 25 July, 2017;
originally announced July 2017.