-
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations
Authors:
Matthias Lindemann,
Alexander Koller,
Ivan Titov
Abstract:
Models need appropriate inductive biases to effectively learn from small amounts of data and generalize systematically outside of the training distribution. While Transformers are highly versatile and powerful, they can still benefit from enhanced structural inductive biases for seq2seq tasks, especially those involving syntactic transformations, such as converting active to passive voice or seman…
▽ More
Models need appropriate inductive biases to effectively learn from small amounts of data and generalize systematically outside of the training distribution. While Transformers are highly versatile and powerful, they can still benefit from enhanced structural inductive biases for seq2seq tasks, especially those involving syntactic transformations, such as converting active to passive voice or semantic parsing. In this paper, we propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training to perform synthetically generated syntactic transformations of dependency trees given a description of the transformation. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking, and also improves structural generalization for semantic parsing. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token, and that the model can leverage these attention heads on downstream tasks.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Magnetic microstructure of nanocrystalline Fe-Nb-B alloys as seen by small-angle neutron and X-ray scattering
Authors:
Venus Rai,
Ivan Titov,
Michael P. Adams,
Kiyonori Suzuki,
Joachim Kohlbrecher,
Andreas Michels
Abstract:
We have investigated the magnetic microstructure of two-phase Fe-Nb-B~based Nanoperm alloys using unpolarized small-angle neutron scattering (SANS) and small-angle X-ray scattering (SAXS). Our SANS analysis reveals a significantly large magnetic scattering contribution due to spin misalignment, primarily originating from the substantial jump in the longitudinal magnetization at the interfaces betw…
▽ More
We have investigated the magnetic microstructure of two-phase Fe-Nb-B~based Nanoperm alloys using unpolarized small-angle neutron scattering (SANS) and small-angle X-ray scattering (SAXS). Our SANS analysis reveals a significantly large magnetic scattering contribution due to spin misalignment, primarily originating from the substantial jump in the longitudinal magnetization at the interfaces between the particles and the matrix. The magnetic scattering exhibits an angular anisotropy that resembles a clover-leaf-type pattern, consistent with the predictions of micromagnetic SANS theory. Analysis of the one-dimensional SANS data yields values for the exchange-stiffness constant and the average anisotropy and magnetostatic fields. The micromagnetic correlation lengths for all three samples exhibit a similar field variation with sizes ranging between about 10-35 nm. We also find that the nuclear and magnetic residual scattering component of the SANS cross section exhibits a similar $q$ dependency as the SAXS data. These findings further validate the applicability of the micromagnetic SANS theory, and the mesoscopic information obtained is important for the advancement of the soft magnetic properties of this class of material.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection
Authors:
Guillem Ramírez,
Alexandra Birch,
Ivan Titov
Abstract:
Researchers and practitioners operating on a limited budget face the cost-performance trade-off dilemma. The challenging decision often centers on whether to use a large LLM with better performance or a smaller one with reduced costs. This has motivated recent research in the optimisation of LLM calls. Either a cascading strategy is used, where a smaller LLM or both are called sequentially, or a r…
▽ More
Researchers and practitioners operating on a limited budget face the cost-performance trade-off dilemma. The challenging decision often centers on whether to use a large LLM with better performance or a smaller one with reduced costs. This has motivated recent research in the optimisation of LLM calls. Either a cascading strategy is used, where a smaller LLM or both are called sequentially, or a routing strategy is used, where only one model is ever called. Both scenarios are dependent on a decision criterion which is typically implemented by an extra neural model. In this work, we propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. We compare our approach with both cascading and routing strategies using three different pairs of pre-trained small and large LLMs, on nine different tasks and against approaches that require an additional neural model. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Unlearning Traces the Influential Training Data of Language Models
Authors:
Masaru Isonuma,
Ivan Titov
Abstract:
Identifying the training datasets that influence a language model's outputs is essential for minimizing the generation of harmful content and enhancing its performance. Ideally, we can measure the influence of each dataset by removing it from training; however, it is prohibitively expensive to retrain a model multiple times. This paper presents UnTrac: unlearning traces the influence of a training…
▽ More
Identifying the training datasets that influence a language model's outputs is essential for minimizing the generation of harmful content and enhancing its performance. Ideally, we can measure the influence of each dataset by removing it from training; however, it is prohibitively expensive to retrain a model multiple times. This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance. UnTrac is extremely simple; each training dataset is unlearned by gradient ascent, and we evaluate how much the model's predictions change after unlearning. Furthermore, we propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets. UnTrac-Inv resembles UnTrac, while being efficient for massive training datasets. In the experiments, we examine if our methods can assess the influence of pretraining datasets on generating toxic, biased, and untruthful content. Our methods estimate their influence much more accurately than existing methods while requiring neither excessive memory space nor multiple checkpoints.
△ Less
Submitted 13 June, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
Optical and thermal effects in the neighborhood of the spherical layered nanoparticle of the "metallic core -- J-aggregate shell'' structure
Authors:
A. V. Korotun,
N. A. Smirnova,
V. I. Reva,
I. M. Titov,
G. M. Shilo
Abstract:
The relations for the polarizability of the metallic nanoparticles, coated with the shell of cyanine dyes, are obtained in the article. The frequency dependencies for light absorption and scattering efficiencies, the heating of the composite nanoparticle and the electric field amplification in its neighborhood are studied. It is established that all the dependencies have three maxima which corresp…
▽ More
The relations for the polarizability of the metallic nanoparticles, coated with the shell of cyanine dyes, are obtained in the article. The frequency dependencies for light absorption and scattering efficiencies, the heating of the composite nanoparticle and the electric field amplification in its neighborhood are studied. It is established that all the dependencies have three maxima which correspond to the frequencies of hybrid plasmon-exciton resonance. It is shown that an increase in content of metal in the nanoparticle causes a blue shift of the maxima from the visible part of the spectrum and a red shift of the maximum from ultraviolet frequency range. The issue of application of metal-organic nanoparticles in nanomedicine, in particular for the photothermal therapy of malignant neoplasms is studied.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Compositional Generalization for Data-to-Text Generation
Authors:
Xinnuo Xu,
Ivan Titov,
Mirella Lapata
Abstract:
Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to…
▽ More
Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to create a benchmark for assessing the performance of different approaches on this specific problem. Furthermore, we propose a novel model that addresses compositional generalization by clustering predicates into groups. Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time. This approach significantly outperforms T5~baselines across all evaluation metrics.Notably, it achieved a 31% improvement over T5 in terms of a metric focused on maintaining faithfulness to the input.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study
Authors:
Maike Züfle,
Verna Dankers,
Ivan Titov
Abstract:
With the ever-growing presence of social media platforms comes the increased spread of harmful content and the need for robust hate speech detection systems. Such systems easily overfit to specific targets and keywords, and evaluating them without considering distribution shifts that might occur between train and test data overestimates their benefit. We challenge hate speech models via new train-…
▽ More
With the ever-growing presence of social media platforms comes the increased spread of harmful content and the need for robust hate speech detection systems. Such systems easily overfit to specific targets and keywords, and evaluating them without considering distribution shifts that might occur between train and test data overestimates their benefit. We challenge hate speech models via new train-test splits of existing datasets that rely on the clustering of models' hidden representations. We present two split variants (Subset-Sum-Split and Closest-Split) that, when applied to two datasets using four pretrained models, reveal how models catastrophically fail on blind spots in the latent space. This result generalises when develo** a split with one model and evaluating it on another. Our analysis suggests that there is no clear surface-level property of the data split that correlates with the decreased performance, which underscores that task difficulty is not always humanly interpretable. We recommend incorporating latent feature-based splits in model development and release two splits via the GenBench benchmark.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Memorisation Cartography: Map** out the Memorisation-Generalisation Continuum in Neural Machine Translation
Authors:
Verna Dankers,
Ivan Titov,
Dieuwke Hupkes
Abstract:
When training a neural network, it will quickly memorise some source-target map**s from your dataset but never learn some others. Yet, memorisation is not easily expressed as a binary feature that is good or bad: individual datapoints lie on a memorisation-generalisation continuum. What determines a datapoint's position on that spectrum, and how does that spectrum influence neural models' perfor…
▽ More
When training a neural network, it will quickly memorise some source-target map**s from your dataset but never learn some others. Yet, memorisation is not easily expressed as a binary feature that is good or bad: individual datapoints lie on a memorisation-generalisation continuum. What determines a datapoint's position on that spectrum, and how does that spectrum influence neural models' performance? We address these two questions for neural machine translation (NMT) models. We use the counterfactual memorisation metric to (1) build a resource that places 5M NMT datapoints on a memorisation-generalisation map, (2) illustrate how the datapoints' surface-level characteristics and a models' per-datum training signals are predictive of memorisation in NMT, (3) and describe the influence that subsets of that map have on NMT systems' performance.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training
Authors:
Max Müller-Eberstein,
Rob van der Goot,
Barbara Plank,
Ivan Titov
Abstract:
Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational s…
▽ More
Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational subspaces, we analyze nine tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Across these phases, syntactic knowledge is acquired rapidly after 0.5% of full training. Continued performance improvements primarily stem from the acquisition of open-domain knowledge, while semantics and reasoning tasks benefit from later boosts to long-range contextualization and higher specialization. Measuring cross-task similarity further reveals that linguistically related tasks share information throughout training, and do so more during the critical phase of learning than before or after. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
Cross-Modal Conceptualization in Bottleneck Models
Authors:
Danis Alukaev,
Semen Kiselev,
Ilya Pershin,
Bulat Ibragimov,
Vladimir Ivanov,
Alexey Kornaev,
Ivan Titov
Abstract:
Concept Bottleneck Models (CBMs) assume that training examples (e.g., x-ray images) are annotated with high-level concepts (e.g., types of abnormalities), and perform classification by first predicting the concepts, followed by predicting the label relying on these concepts. The main difficulty in using CBMs comes from having to choose concepts that are predictive of the label and then having to l…
▽ More
Concept Bottleneck Models (CBMs) assume that training examples (e.g., x-ray images) are annotated with high-level concepts (e.g., types of abnormalities), and perform classification by first predicting the concepts, followed by predicting the label relying on these concepts. The main difficulty in using CBMs comes from having to choose concepts that are predictive of the label and then having to label training examples with these concepts. In our approach, we adopt a more moderate assumption and instead use text descriptions (e.g., radiology reports), accompanying the images in training, to guide the induction of concepts. Our cross-modal approach treats concepts as discrete latent variables and promotes concepts that (1) are predictive of the label, and (2) can be predicted reliably from both the image and text. Through experiments conducted on datasets ranging from synthetic datasets (e.g., synthetic images with generated descriptions) to realistic medical imaging datasets, we demonstrate that cross-modal learning encourages the induction of interpretable concepts while also facilitating disentanglement. Our results also suggest that this guidance leads to increased robustness by suppressing the reliance on shortcut features.
△ Less
Submitted 17 December, 2023; v1 submitted 23 October, 2023;
originally announced October 2023.
-
On the Transferability of Visually Grounded PCFGs
Authors:
Yanpeng Zhao,
Ivan Titov
Abstract:
There has been a significant surge of interest in visually grounded grammar induction in recent times. While a variety of models have been developed for the task and have demonstrated impressive performance, they have not been evaluated on text domains that are different from the training domain, so it is unclear if the improvements brought by visual groundings are transferable. Our study aims to…
▽ More
There has been a significant surge of interest in visually grounded grammar induction in recent times. While a variety of models have been developed for the task and have demonstrated impressive performance, they have not been evaluated on text domains that are different from the training domain, so it is unclear if the improvements brought by visual groundings are transferable. Our study aims to fill this gap and assess the degree of transferability. We start by extending VC-PCFG (short for Visually-grounded Compound PCFG~\citep{zhao-titov-2020-visually}) in such a way that it can transfer across text domains. We consider a zero-shot transfer learning setting where a model is trained on the source domain and is directly applied to target domains, without any further training. Our experimental results suggest that: the benefits from using visual groundings transfer to text in a domain similar to the training domain but fail to transfer to remote domains. Further, we conduct data and result analysis; we find that the lexicon overlap between the source domain and the target domain is the most important factor in the transferability of VC-PCFG.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
Cache & Distil: Optimising API Calls to Large Language Models
Authors:
Guillem Ramírez,
Matthias Lindemann,
Alexandra Birch,
Ivan Titov
Abstract:
Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries. To curtail the frequency of these calls, one can employ a smaller language model -- a student -- which is continuously trained on the responses of the LLM. This student gradually gains proficiency in independently handling an increasing number of user requests, a…
▽ More
Large-scale deployment of generative AI tools often depends on costly API calls to a Large Language Model (LLM) to fulfil user queries. To curtail the frequency of these calls, one can employ a smaller language model -- a student -- which is continuously trained on the responses of the LLM. This student gradually gains proficiency in independently handling an increasing number of user requests, a process we term neural caching. The crucial element in neural caching is a policy that decides which requests should be processed by the student alone and which should be redirected to the LLM, subsequently aiding the student's learning. In this study, we focus on classification tasks, and we consider a range of classic active learning-based selection criteria as the policy. Our experiments suggest that Margin Sampling and Query by Committee bring consistent benefits across tasks and budgets.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Injecting a Structural Inductive Bias into a Seq2Seq Model by Simulation
Authors:
Matthias Lindemann,
Alexander Koller,
Ivan Titov
Abstract:
Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-traine…
▽ More
Strong inductive biases enable learning from little data and help generalization outside of the training distribution. Popular neural architectures such as Transformers lack strong structural inductive biases for seq2seq NLP tasks on their own. Consequently, they struggle with systematic generalization beyond the training distribution, e.g. with extrapolating to longer inputs, even when pre-trained on large amounts of text. We show how a structural inductive bias can be efficiently injected into a seq2seq model by pre-training it to simulate structural transformations on synthetic data. Specifically, we inject an inductive bias towards Finite State Transducers (FSTs) into a Transformer by pre-training it to simulate FSTs given their descriptions. Our experiments show that our method imparts the desired inductive bias, resulting in improved systematic generalization and better few-shot learning for FST-like tasks. Our analysis shows that fine-tuned models accurately capture the state dynamics of the unseen underlying FSTs, suggesting that the simulation process is internalized by the fine-tuned model.
△ Less
Submitted 16 February, 2024; v1 submitted 1 October, 2023;
originally announced October 2023.
-
Polarization of recoil photon in non-linear Compton process
Authors:
A. I. Titov
Abstract:
The polarization of recoil photon ($γ'$) in the non-linear Compton process $e + \vec L \to \vec γ' +e'$ in the interaction of a relativistic electron with a linearly polarized laser beam ($\vec L$) is studied within the Furry picture in the lowest-order, tree-level S matrix element. In particular, we consider the asymmetry of differential cross sections ${\cal A}$ for two independent axes describi…
▽ More
The polarization of recoil photon ($γ'$) in the non-linear Compton process $e + \vec L \to \vec γ' +e'$ in the interaction of a relativistic electron with a linearly polarized laser beam ($\vec L$) is studied within the Furry picture in the lowest-order, tree-level S matrix element. In particular, we consider the asymmetry of differential cross sections ${\cal A}$ for two independent axes describing the Compton process equal to the intrinsic spin variable $ξ^f_3$, that determines the polarization properties of $γ'$. The sign and absolute value of the asymmetry determine the direction and degree of $γ'$ polarization. We have analyzed the process in a wide range of laser intensity that covers existing and future experiments. Our results provide additional knowledge for studying nonlinear multi-photon effects in quantum electrodynamics and can be used in planning experiments at envisaged laser facilities.
△ Less
Submitted 26 March, 2024; v1 submitted 2 July, 2023;
originally announced July 2023.
-
Autoencoding Conditional Neural Processes for Representation Learning
Authors:
Victor Prokhorov,
Ivan Titov,
N. Siddharth
Abstract:
Conditional neural processes (CNPs) are a flexible and efficient family of models that learn to learn a stochastic process from data. They have seen particular application in contextual image completion - observing pixel values at some locations to predict a distribution over values at other unobserved locations. However, the choice of pixels in learning CNPs is typically either random or derived…
▽ More
Conditional neural processes (CNPs) are a flexible and efficient family of models that learn to learn a stochastic process from data. They have seen particular application in contextual image completion - observing pixel values at some locations to predict a distribution over values at other unobserved locations. However, the choice of pixels in learning CNPs is typically either random or derived from a simple statistical measure (e.g. pixel variance). Here, we turn the problem on its head and ask: which pixels would a CNP like to observe - do they facilitate fitting better CNPs, and do such pixels tell us something meaningful about the underlying image? To this end we develop the Partial Pixel Space Variational Autoencoder (PPS-VAE), an amortised variational framework that casts CNP context as latent variables learnt simultaneously with the CNP. We evaluate PPS-VAE over a number of tasks across different visual data, and find that not only can it facilitate better-fit CNPs, but also that the spatial arrangement and values meaningfully characterise image information - evaluated through the lens of classification on both within and out-of-data distributions. Our model additionally allows for dynamic adaption of context-set size and the ability to scale-up to larger images, providing a promising avenue to explore learning meaningful and effective visual representations.
△ Less
Submitted 17 February, 2024; v1 submitted 29 May, 2023;
originally announced May 2023.
-
Theoretical and Practical Perspectives on what Influence Functions Do
Authors:
Andrea Schioppa,
Katja Filippova,
Ivan Titov,
Polina Zablotskaia
Abstract:
Influence functions (IF) have been seen as a technique for explaining model predictions through the lens of the training data. Their utility is assumed to be in identifying training examples "responsible" for a prediction so that, for example, correcting a prediction is possible by intervening on those examples (removing or editing them) and retraining the model. However, recent empirical studies…
▽ More
Influence functions (IF) have been seen as a technique for explaining model predictions through the lens of the training data. Their utility is assumed to be in identifying training examples "responsible" for a prediction so that, for example, correcting a prediction is possible by intervening on those examples (removing or editing them) and retraining the model. However, recent empirical studies have shown that the existing methods of estimating IF predict the leave-one-out-and-retrain effect poorly.
In order to understand the mismatch between the theoretical promise and the practical results, we analyse five assumptions made by IF methods which are problematic for modern-scale deep neural networks and which concern convexity, numeric stability, training trajectory and parameter divergence. This allows us to clarify what can be expected theoretically from IF. We show that while most assumptions can be addressed successfully, the parameter divergence poses a clear limitation on the predictive power of IF: influence fades over training time even with deterministic training. We illustrate this theoretical result with BERT and ResNet models.
Another conclusion from the theoretical analysis is that IF are still useful for model debugging and correcting even though some of the assumptions made in prior work do not hold: using natural language processing and computer vision tasks, we verify that mis-predictions can be successfully corrected by taking only a few fine-tuning steps on influential examples.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Compositional Generalization without Trees using Multiset Tagging and Latent Permutations
Authors:
Matthias Lindemann,
Alexander Koller,
Ivan Titov
Abstract:
Seq2seq models have been shown to struggle with compositional generalization in semantic parsing, i.e. generalizing to unseen compositions of phenomena that the model handles correctly in isolation.
We phrase semantic parsing as a two-step process: we first tag each input token with a multiset of output tokens. Then we arrange the tokens into an output sequence using a new way of parameterizing…
▽ More
Seq2seq models have been shown to struggle with compositional generalization in semantic parsing, i.e. generalizing to unseen compositions of phenomena that the model handles correctly in isolation.
We phrase semantic parsing as a two-step process: we first tag each input token with a multiset of output tokens. Then we arrange the tokens into an output sequence using a new way of parameterizing and predicting permutations. We formulate predicting a permutation as solving a regularized linear program and we backpropagate through the solver. In contrast to prior work, our approach does not place a priori restrictions on possible permutations, making it very expressive.
Our model outperforms pretrained seq2seq models and prior work on realistic semantic parsing tasks that require generalization to longer examples. We also outperform non-tree-based models on structural generalization on the COGS benchmark. For the first time, we show that a model without an inductive bias provided by trees achieves high accuracy on generalization to deeper recursion.
△ Less
Submitted 26 May, 2023;
originally announced May 2023.
-
Fingerprint of vortex-like flux closure in isotropic Nd-Fe-B bulk magnet
Authors:
Mathias Bersweiler,
Yojiro Oba,
Evelyn Pratami Sinaga,
Inma Peral,
Ivan Titov,
Michael P. Adams,
Konstantin L. Metlov,
Andreas Michels
Abstract:
Taking advantage of recent progress in neutron instrumentation and in the understanding of magnetic-field-dependent small-angle neutron scattering, here, we study the three-dimensional magnetization distribution within an isotropic Nd-Fe-B bulk magnet. The magnetic neutron scattering cross section of this system features the so-called spike anisotropy, which points towards the presence of a strong…
▽ More
Taking advantage of recent progress in neutron instrumentation and in the understanding of magnetic-field-dependent small-angle neutron scattering, here, we study the three-dimensional magnetization distribution within an isotropic Nd-Fe-B bulk magnet. The magnetic neutron scattering cross section of this system features the so-called spike anisotropy, which points towards the presence of a strong magnetodipolar interaction. This experimental result combined with a damped oscillatory behavior of the corresponding correlation function and recent micromagnetic simulation results on spherical nanoparticles suggest an interpretation of the neutron data in terms of vortex-like flux-closure patterns. The field-dependent correlation length Lc is well reproduced by a phenomenological power-law model. While the experimental neutron data for Lc are described by an exponent close to unity (p = 0.86), the simulation results yield p = 1.70, posing a challenge to theory to include vortex-vortex interaction effects.
△ Less
Submitted 17 October, 2023; v1 submitted 27 March, 2023;
originally announced March 2023.
-
Recursive Neural Networks with Bottlenecks Diagnose (Non-)Compositionality
Authors:
Verna Dankers,
Ivan Titov
Abstract:
A recent line of work in NLP focuses on the (dis)ability of models to generalise compositionally for artificial languages. However, when considering natural language tasks, the data involved is not strictly, or locally, compositional. Quantifying the compositionality of data is a challenging task, which has been investigated primarily for short utterances. We use recursive neural models (Tree-LSTM…
▽ More
A recent line of work in NLP focuses on the (dis)ability of models to generalise compositionally for artificial languages. However, when considering natural language tasks, the data involved is not strictly, or locally, compositional. Quantifying the compositionality of data is a challenging task, which has been investigated primarily for short utterances. We use recursive neural models (Tree-LSTMs) with bottlenecks that limit the transfer of information between nodes. We illustrate that comparing data's representations in models with and without the bottleneck can be used to produce a compositionality metric. The procedure is applied to the evaluation of arithmetic expressions using synthetic data, and sentiment classification using natural language data. We demonstrate that compression through a bottleneck impacts non-compositional examples disproportionately and then use the bottleneck compositionality metric (BCM) to distinguish compositional from non-compositional samples, yielding a compositionality ranking over a dataset.
△ Less
Submitted 31 January, 2023;
originally announced January 2023.
-
Hierarchical Phrase-based Sequence-to-Sequence Learning
Authors:
Bailin Wang,
Ivan Titov,
Jacob Andreas,
Yoon Kim
Abstract:
We describe a neural transducer that maintains the flexibility of standard sequence-to-sequence (seq2seq) models while incorporating hierarchical phrases as a source of inductive bias during training and as explicit constraints during inference. Our approach trains two models: a discriminative parser based on a bracketing transduction grammar whose derivation tree hierarchically aligns source and…
▽ More
We describe a neural transducer that maintains the flexibility of standard sequence-to-sequence (seq2seq) models while incorporating hierarchical phrases as a source of inductive bias during training and as explicit constraints during inference. Our approach trains two models: a discriminative parser based on a bracketing transduction grammar whose derivation tree hierarchically aligns source and target phrases, and a neural seq2seq model that learns to translate the aligned phrases one-by-one. We use the same seq2seq model to translate at all phrase scales, which results in two inference modes: one mode in which the parser is discarded and only the seq2seq component is used at the sequence-level, and another in which the parser is combined with the seq2seq model. Decoding in the latter mode is done with the cube-pruned CKY algorithm, which is more involved but can make use of new translation rules during inference. We formalize our model as a source-conditioned synchronous grammar and develop an efficient variational inference algorithm for training. When applied on top of both randomly initialized and pretrained seq2seq models, we find that both inference modes performs well compared to baselines on small scale machine translation benchmarks.
△ Less
Submitted 15 November, 2022; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Compositional Generalisation with Structured Reordering and Fertility Layers
Authors:
Matthias Lindemann,
Alexander Koller,
Ivan Titov
Abstract:
Seq2seq models have been shown to struggle with compositional generalisation, i.e. generalising to new and potentially more complex structures than seen during training. Taking inspiration from grammar-based models that excel at compositional generalisation, we present a flexible end-to-end differentiable neural model that composes two structural operations: a fertility step, which we introduce in…
▽ More
Seq2seq models have been shown to struggle with compositional generalisation, i.e. generalising to new and potentially more complex structures than seen during training. Taking inspiration from grammar-based models that excel at compositional generalisation, we present a flexible end-to-end differentiable neural model that composes two structural operations: a fertility step, which we introduce in this work, and a reordering step based on previous work (Wang et al., 2021). To ensure differentiability, we use the expected value of each step. Our model outperforms seq2seq models by a wide margin on challenging compositional splits of realistic semantic parsing tasks that require generalisation to longer examples. It also compares favourably to other models targeting compositional generalisation.
△ Less
Submitted 15 February, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation
Authors:
Verna Dankers,
Christopher G. Lucas,
Ivan Titov
Abstract:
Unlike literal expressions, idioms' meanings do not directly follow from their parts, posing a challenge for neural machine translation (NMT). NMT models are often unable to translate idioms accurately and over-generate compositional, literal translations. In this work, we investigate whether the non-compositionality of idioms is reflected in the mechanics of the dominant NMT model, Transformer, b…
▽ More
Unlike literal expressions, idioms' meanings do not directly follow from their parts, posing a challenge for neural machine translation (NMT). NMT models are often unable to translate idioms accurately and over-generate compositional, literal translations. In this work, we investigate whether the non-compositionality of idioms is reflected in the mechanics of the dominant NMT model, Transformer, by analysing the hidden states and attention patterns for models with English as source language and one of seven European languages as target language. When Transformer emits a non-literal translation - i.e. identifies the expression as idiomatic - the encoder processes idioms more strongly as single lexical units compared to literal expressions. This manifests in idioms' parts being grouped through attention and in reduced interaction between idioms and their context. In the decoder's cross-attention, figurative inputs result in reduced attention on source-side tokens. These results suggest that Transformer's tendency to process idioms as compositional expressions contributes to literal translations of idioms.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
Uniaxial polarization analysis of bulk ferromagnets: Theory and first experimental Results
Authors:
A. Malyeyev,
I. Titov,
C. D. Dewhurst,
K. Suzuki,
D. Honecker,
A. Michels
Abstract:
Based on Brown's static equations of micromagnetics, we compute the uniaxial polarization of the scattered neutron beam of a bulk magnetic material. The theoretical expressions are compared to experimental data on a soft magnetic nanocrystalline alloy. The micromagnetic SANS theory provides a general framework for polarized real-space neutron methods, and it opens up a new avenue for magnetic neut…
▽ More
Based on Brown's static equations of micromagnetics, we compute the uniaxial polarization of the scattered neutron beam of a bulk magnetic material. The theoretical expressions are compared to experimental data on a soft magnetic nanocrystalline alloy. The micromagnetic SANS theory provides a general framework for polarized real-space neutron methods, and it opens up a new avenue for magnetic neutron data analysis on magnetic microstructures.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Sparse Interventions in Language Models with Differentiable Masking
Authors:
Nicola De Cao,
Leon Schmid,
Dieuwke Hupkes,
Ivan Titov
Abstract:
There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the encoded information, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discove…
▽ More
There has been a lot of interest in understanding what information is captured by hidden representations of language models (LMs). Typically, interpretation methods i) do not guarantee that the model actually uses the encoded information, and ii) do not discover small subsets of neurons responsible for a considered phenomenon. Inspired by causal mediation analysis, we propose a method that discovers within a neural LM a small subset of neurons responsible for a particular linguistic phenomenon, i.e., subsets causing a change in the corresponding token emission probabilities. We use a differentiable relaxation to approximately search through the combinatorial space. An $L_0$ regularization term ensures that the search converges to discrete and sparse solutions. We apply our method to analyze subject-verb number agreement and gender bias detection in LSTMs. We observe that it is fast and finds better solutions than the alternative (REINFORCE). Our experiments confirm that each of these phenomenons is mediated through a small subset of neurons that do not play any other discernible role.
△ Less
Submitted 13 December, 2021;
originally announced December 2021.
-
Learning Opinion Summarizers by Selecting Informative Reviews
Authors:
Arthur Bražinskas,
Mirella Lapata,
Ivan Titov
Abstract:
Opinion summarization has been traditionally approached with unsupervised, weakly-supervised and few-shot learning techniques. In this work, we collect a large dataset of summaries paired with user reviews for over 31,000 products, enabling supervised training. However, the number of reviews per product is large (320 on average), making summarization - and especially training a summarizer - imprac…
▽ More
Opinion summarization has been traditionally approached with unsupervised, weakly-supervised and few-shot learning techniques. In this work, we collect a large dataset of summaries paired with user reviews for over 31,000 products, enabling supervised training. However, the number of reviews per product is large (320 on average), making summarization - and especially training a summarizer - impractical. Moreover, the content of many reviews is not reflected in the human-written summaries, and, thus, the summarizer trained on random review subsets hallucinates. In order to deal with both of these challenges, we formulate the task as jointly learning to select informative subsets of reviews and summarizing the opinions expressed in these subsets. The choice of the review subset is treated as a latent variable, predicted by a small and simple selector. The subset is then fed into a more powerful summarizer. For joint training, we use amortized variational inference and policy gradient methods. Our experiments demonstrate the importance of selecting informative reviews resulting in improved quality of summaries and reduced hallucinations.
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
Highly Parallel Autoregressive Entity Linking with Discriminative Correction
Authors:
Nicola De Cao,
Wilker Aziz,
Ivan Titov
Abstract:
Generative approaches have been recently shown to be effective for both Entity Disambiguation and Entity Linking (i.e., joint mention detection and disambiguation). However, the previously proposed autoregressive formulation for EL suffers from i) high computational cost due to a complex (deep) decoder, ii) non-parallelizable decoding that scales with the source sequence length, and iii) the need…
▽ More
Generative approaches have been recently shown to be effective for both Entity Disambiguation and Entity Linking (i.e., joint mention detection and disambiguation). However, the previously proposed autoregressive formulation for EL suffers from i) high computational cost due to a complex (deep) decoder, ii) non-parallelizable decoding that scales with the source sequence length, and iii) the need for training on a large amount of data. In this work, we propose a very efficient approach that parallelizes autoregressive linking across all potential mentions and relies on a shallow and efficient decoder. Moreover, we augment the generative objective with an extra discriminative component, i.e., a correction term which lets us directly optimize the generator's ranking. When taken together, these techniques tackle all the above issues: our model is >70 times faster and more accurate than the previous generative method, outperforming state-of-the-art approaches on the standard English dataset AIDA-CoNLL. Source code available at https://github.com/nicola-decao/efficient-autoregressive-EL
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT
Authors:
Elena Voita,
Rico Sennrich,
Ivan Titov
Abstract:
Differently from the traditional statistical MT that decomposes the translation task into distinct separately learned components, neural machine translation uses a single neural network to model the entire translation process. Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training, and how this mirr…
▽ More
Differently from the traditional statistical MT that decomposes the translation task into distinct separately learned components, neural machine translation uses a single neural network to model the entire translation process. Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training, and how this mirrors the different models in traditional SMT. In this work, we look at the competences related to three core SMT components and find that during training, NMT first focuses on learning target-side language modeling, then improves translation quality approaching word-by-word translation, and finally learns more complicated reordering patterns. We show that this behavior holds for several models and language pairs. Additionally, we explain how such an understanding of the training process can be useful in practice and, as an example, show how it can be used to improve vanilla non-autoregressive neural machine translation by guiding teacher model selection.
△ Less
Submitted 3 September, 2021;
originally announced September 2021.
-
Positron energy distribution in factorized trident process
Authors:
A. I. Titov,
U. Hernandez Acosta,
B. Kampfer
Abstract:
We estimate the energy distribution of positrons produced in the interaction of ultra-relativistic electrons with a high-intensity laser beam. The underlying trident process is factorized on the probabilistic level. That is, we deploy a two-step mechanism for the formation of electron-positron pairs. In the first step, a high-energy photon is produced as a result of nonlinear Compton scattering. I…
▽ More
We estimate the energy distribution of positrons produced in the interaction of ultra-relativistic electrons with a high-intensity laser beam. The underlying trident process is factorized on the probabilistic level. That is, we deploy a two-step mechanism for the formation of electron-positron pairs. In the first step, a high-energy photon is produced as a result of nonlinear Compton scattering. In the second step, an electron-positron pair is created by the nonlinear (multi-photon) Breit-Wheeler process.
△ Less
Submitted 29 December, 2021; v1 submitted 30 August, 2021;
originally announced August 2021.
-
Exploring Unsupervised Pretraining Objectives for Machine Translation
Authors:
Christos Baziotis,
Ivan Titov,
Alexandra Birch,
Barry Haddow
Abstract:
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT), by drastically reducing the need for large parallel data. Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder. In this work, we systematically compare masking with alternative objectives…
▽ More
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT), by drastically reducing the need for large parallel data. Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder. In this work, we systematically compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context. We pretrain models with different methods on English$\leftrightarrow$German, English$\leftrightarrow$Nepali and English$\leftrightarrow$Sinhala monolingual data, and evaluate them on NMT. In (semi-) supervised NMT, varying the pretraining objective leads to surprisingly small differences in the finetuned performance, whereas unsupervised NMT is much more sensitive to it. To understand these results, we thoroughly study the pretrained models using a series of probes and verify that they encode and use information in different ways. We conclude that finetuning on parallel data is mostly sensitive to few properties that are shared by most models, such as a strong decoder, in contrast to unsupervised NMT that also requires models with strong cross-lingual abilities.
△ Less
Submitted 10 June, 2021;
originally announced June 2021.
-
Meta-Learning to Compositionally Generalize
Authors:
Henry Conklin,
Bailin Wang,
Kenny Smith,
Ivan Titov
Abstract:
Natural language is compositional; the meaning of a sentence is a function of the meaning of its parts. This property allows humans to create and interpret novel sentences, generalizing robustly outside their prior experience. Neural networks have been shown to struggle with this kind of generalization, in particular performing poorly on tasks designed to assess compositional generalization (i.e.…
▽ More
Natural language is compositional; the meaning of a sentence is a function of the meaning of its parts. This property allows humans to create and interpret novel sentences, generalizing robustly outside their prior experience. Neural networks have been shown to struggle with this kind of generalization, in particular performing poorly on tasks designed to assess compositional generalization (i.e. where training and testing distributions differ in ways that would be trivial for a compositional strategy to resolve). Their poor performance on these tasks may in part be due to the nature of supervised learning which assumes training and testing data to be drawn from the same distribution. We implement a meta-learning augmented version of supervised learning whose objective directly optimizes for out-of-distribution generalization. We construct pairs of tasks for meta-learning by sub-sampling existing training data. Each pair of tasks is constructed to contain relevant examples, as determined by a similarity metric, in an effort to inhibit models from memorizing their input. Experimental results on the COGS and SCAN datasets show that our similarity-driven meta-learning can improve generalization performance.
△ Less
Submitted 29 June, 2021; v1 submitted 8 June, 2021;
originally announced June 2021.
-
Structured Reordering for Modeling Latent Alignments in Sequence Transduction
Authors:
Bailin Wang,
Mirella Lapata,
Ivan Titov
Abstract:
Despite success in many domains, neural models struggle in settings where train and test examples are drawn from different distributions. In particular, in contrast to humans, conventional sequence-to-sequence (seq2seq) models fail to generalize systematically, i.e., interpret sentences representing novel combinations of concepts (e.g., text segments) seen in training. Traditional grammar formalis…
▽ More
Despite success in many domains, neural models struggle in settings where train and test examples are drawn from different distributions. In particular, in contrast to humans, conventional sequence-to-sequence (seq2seq) models fail to generalize systematically, i.e., interpret sentences representing novel combinations of concepts (e.g., text segments) seen in training. Traditional grammar formalisms excel in such settings by implicitly encoding alignments between input and output segments, but are hard to scale and maintain. Instead of engineering a grammar, we directly model segment-to-segment alignments as discrete structured latent variables within a neural seq2seq model. To efficiently explore the large space of alignments, we introduce a reorder-first align-later framework whose central component is a neural reordering module producing {\it separable} permutations. We present an efficient dynamic programming algorithm performing exact marginal inference of separable permutations, and, thus, enabling end-to-end differentiable training of our model. The resulting seq2seq model exhibits better systematic generalization than standard models on synthetic problems and NLP tasks (i.e., semantic parsing and machine translation).
△ Less
Submitted 26 October, 2021; v1 submitted 6 June, 2021;
originally announced June 2021.
-
Rise and fall of laser-intensity effects in spectrally resolved Compton process
Authors:
Uwe Hernandez Acosta,
Alexander I. Titov,
Burkhard Kämpfer
Abstract:
The spectrally resolved differential cross section of Compton scattering, $d σ/ d ω' \vert_{ω' = const}$, rises from small towards larger laser intensity parameter $ξ$, reaches a maximum, and falls towards the asymptotic strong-field region. Expressed by invariant quantities: $d σ/du \vert_{u = const}$ rises from small towards larger values of $ξ$, reaches a maximum at…
▽ More
The spectrally resolved differential cross section of Compton scattering, $d σ/ d ω' \vert_{ω' = const}$, rises from small towards larger laser intensity parameter $ξ$, reaches a maximum, and falls towards the asymptotic strong-field region. Expressed by invariant quantities: $d σ/du \vert_{u = const}$ rises from small towards larger values of $ξ$, reaches a maximum at $ξ_{max} = \frac49 {\cal K} u m^2 / k \cdot p$, ${\cal K} = {\cal O} (1)$, and falls at $ξ> ξ_{max}$ like $\propto ξ^{-3/2} \exp \left (- \frac{2 u m^2}{3 ξ\, k \cdot p} \right )$ at $u \ge 1$. [The quantity $u$ is the Ritus variable related to the light-front momentum-fraction $s = (1 + u)/u = k \cdot k' / k \cdot p$ of the emitted photon (four-momentum $k'$, frequency $ω'$), and $k \cdot p/m^2$ quantifies the invariant energy in the entrance channel of electron (four-momentum $p$, mass $m$) and laser (four-wave vector $k$).] Such a behavior of a differential observable is to be contrasted with the laser intensity dependence of the total probability, $\lim_{χ= ξk \cdot p/m^2, ξ\to \infty} \mathbb{P} \propto αχ^{2/3} m^2 / k \cdot p$, which is governed by the soft spectral part.
We combine the hard-photon yield from Compton with the seeded Breit-Wheeler pair production in a folding model and obtain a rapidly increasing $e^+ e^-$ pair number at $ξ\lesssim 4$. Laser bandwidth effects are quantified in the weak-field limit of the related trident pair production.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Role of higher-order effects in spin-misalignment small-angle neutron scattering of high-pressure torsion nickel
Authors:
Yojiro Oba,
Mathias Bersweiler,
Ivan Titov,
Nozomu Adachi,
Yoshikazu Todaka,
Elliot Paul Gilbert,
Nina-Juliane Steinke,
Konstantin L. Metlov,
Andreas Michels
Abstract:
Magnetic-field-dependent unpolarized small-angle neutron scattering (SANS) experiments demonstrate that high-pressure torsion (HPT) straining induces spin misalignments in pure Ni, which persist in magnetic fields up to 4 T. The spin-misalignment scattering patterns are elongated perpendicular to the applied magnetic field due to an unusual predominant longitudinal $sin^2(θ)$-type angular anisotro…
▽ More
Magnetic-field-dependent unpolarized small-angle neutron scattering (SANS) experiments demonstrate that high-pressure torsion (HPT) straining induces spin misalignments in pure Ni, which persist in magnetic fields up to 4 T. The spin-misalignment scattering patterns are elongated perpendicular to the applied magnetic field due to an unusual predominant longitudinal $sin^2(θ)$-type angular anisotropy. Such a contribution cannot be explained by the conventional second order (in spin misalignment amplitude) micromagnetic SANS theory in the approach-to-saturation regime, nor can its magnitude relative to the other features of the cross sections by the third order micromagnetic SANS theory. This indicates that the high-density of crystal defects induced via HPT straining in Ni makes such higher-order effects in the micromagnetic SANS cross sections observable.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
Editing Factual Knowledge in Language Models
Authors:
Nicola De Cao,
Wilker Aziz,
Ivan Titov
Abstract:
The factual knowledge acquired during pre-training and stored in the parameters of Language Models (LMs) can be useful in downstream tasks (e.g., question answering or textual inference). However, some facts can be incorrectly induced or become obsolete over time. We present KnowledgeEditor, a method which can be used to edit this knowledge and, thus, fix 'bugs' or unexpected predictions without t…
▽ More
The factual knowledge acquired during pre-training and stored in the parameters of Language Models (LMs) can be useful in downstream tasks (e.g., question answering or textual inference). However, some facts can be incorrectly induced or become obsolete over time. We present KnowledgeEditor, a method which can be used to edit this knowledge and, thus, fix 'bugs' or unexpected predictions without the need for expensive re-training or fine-tuning. Besides being computationally efficient, KnowledgeEditordoes not require any modifications in LM pre-training (e.g., the use of meta-learning). In our approach, we train a hyper-network with constrained optimization to modify a fact without affecting the rest of the knowledge; the trained hyper-network is then used to predict the weight update at test time. We show KnowledgeEditor's efficacy with two popular architectures and knowledge-intensive tasks: i) a BERT model fine-tuned for fact-checking, and ii) a sequence-to-sequence BART model for question answering. With our method, changing a prediction on the specific wording of a query tends to result in a consistent change in predictions also for its paraphrases. We show that this can be further encouraged by exploiting (e.g., automatically-generated) paraphrases during training. Interestingly, our hyper-network can be regarded as a 'probe' revealing which components need to be changed to manipulate factual knowledge; our analysis shows that the updates tend to be concentrated on a small subset of components. Source code available at https://github.com/nicola-decao/KnowledgeEditor
△ Less
Submitted 8 September, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
Sparse Attention with Linear Units
Authors:
Biao Zhang,
Ivan Titov,
Rico Sennrich
Abstract:
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with…
▽ More
Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off') for some queries, which is not possible with sparsified softmax alternatives.
△ Less
Submitted 6 October, 2021; v1 submitted 14 April, 2021;
originally announced April 2021.
-
Learning from Executions for Semantic Parsing
Authors:
Bailin Wang,
Mirella Lapata,
Ivan Titov
Abstract:
Semantic parsing aims at translating natural language (NL) utterances onto machine-interpretable programs, which can be executed against a real-world environment. The expensive annotation of utterance-program pairs has long been acknowledged as a major bottleneck for the deployment of contemporary neural models to real-life applications. In this work, we focus on the task of semi-supervised learni…
▽ More
Semantic parsing aims at translating natural language (NL) utterances onto machine-interpretable programs, which can be executed against a real-world environment. The expensive annotation of utterance-program pairs has long been acknowledged as a major bottleneck for the deployment of contemporary neural models to real-life applications. In this work, we focus on the task of semi-supervised learning where a limited amount of annotated data is available together with many unlabeled NL utterances. Based on the observation that programs which correspond to NL utterances must be always executable, we propose to encourage a parser to generate executable programs for unlabeled utterances. Due to the large search space of executable programs, conventional methods that use approximations based on beam-search such as self-training and top-k marginal likelihood training, do not perform as well. Instead, we view the problem of learning from executions from the perspective of posterior regularization and propose a set of new training objectives. Experimental results on Overnight and GeoQuery show that our new objectives outperform conventional methods, bridging the gap between semi-supervised and supervised learning.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
An Empirical Study of Compound PCFGs
Authors:
Yanpeng Zhao,
Ivan Titov
Abstract:
Compound probabilistic context-free grammars (C-PCFGs) have recently established a new state of the art for unsupervised phrase-structure grammar induction. However, due to the high space and time complexities of chart-based representation and inference, it is difficult to investigate C-PCFGs comprehensively. In this work, we rely on a fast implementation of C-PCFGs to conduct an evaluation comple…
▽ More
Compound probabilistic context-free grammars (C-PCFGs) have recently established a new state of the art for unsupervised phrase-structure grammar induction. However, due to the high space and time complexities of chart-based representation and inference, it is difficult to investigate C-PCFGs comprehensively. In this work, we rely on a fast implementation of C-PCFGs to conduct an evaluation complementary to that of~\citet{kim-etal-2019-compound}. We start by analyzing and ablating C-PCFGs on English treebanks. Our findings suggest that (1) C-PCFGs are data-efficient and can generalize to unseen sentence/constituent lengths; and (2) C-PCFGs make the best use of sentence-level information in generating preterminal rule probabilities. We further conduct a multilingual evaluation of C-PCFGs. The experimental results show that the best configurations of C-PCFGs, which are tuned on English, do not always generalize to morphology-rich languages.
△ Less
Submitted 21 October, 2023; v1 submitted 3 March, 2021;
originally announced March 2021.
-
Impact of laser polarization on q-exponential photon tails in non-linear Compton scattering
Authors:
B. Kampfer,
A. I. Titov
Abstract:
Non-linear Compton scattering of ultra-relativistic electrons traversing high-intensity laser pulses generates also hard photons. These photon high-energy tails are considered for parameters in reach at the forthcoming experiments LUXE and E-320. We consider the invariant differential cross sections $d σ/ du$ between the IR and UV regions and analyze the impact of the laser polarization and find q…
▽ More
Non-linear Compton scattering of ultra-relativistic electrons traversing high-intensity laser pulses generates also hard photons. These photon high-energy tails are considered for parameters in reach at the forthcoming experiments LUXE and E-320. We consider the invariant differential cross sections $d σ/ du$ between the IR and UV regions and analyze the impact of the laser polarization and find q-deformed exponential shapes. (The variable $u$ is the light-cone momentum-transfer from initial electron to final photon.) Optical laser pulses of various durations are compared with the monochromatic laser beam model which uncovers the laser intensity parameter in the range $ξ= 1 \cdots 10$. Some supplementary information is provided for the azimuthal final-electron/photon distributions and the photon energy-differential cross sections.
△ Less
Submitted 18 February, 2021; v1 submitted 14 December, 2020;
originally announced December 2020.
-
Neutron study of magnetic correlations in rare-earth-free Mn-Bi magnets
Authors:
Artem Malyeyev,
Ivan Titov,
Philipp Bender,
Mathias Bersweiler,
Vitaliy Pipich,
Sebastian Mühlbauer,
Semih Ener,
Oliver Gutfleisch,
Andreas Michels
Abstract:
We report the results of an unpolarized small-angle neutron scattering (SANS) study on Mn-Bi-based rare-earth-free permanent magnets. The magnetic SANS cross section is dominated by long-wavelength transversal magnetization fluctuations and has been analyzed in terms of the Guinier-Porod model and the distance distribution function. This provides the radius of gyration which, in the remanent state…
▽ More
We report the results of an unpolarized small-angle neutron scattering (SANS) study on Mn-Bi-based rare-earth-free permanent magnets. The magnetic SANS cross section is dominated by long-wavelength transversal magnetization fluctuations and has been analyzed in terms of the Guinier-Porod model and the distance distribution function. This provides the radius of gyration which, in the remanent state, ranges between about $220-240 \, \mathrm{nm}$ for the three different alloy compositions investigated. Moreover, computation of the distance distribution function in conjunction with results for the so-called $s$-parameter obtained from the Guinier-Porod model indicate that the magnetic scattering of a Mn$_{45}$Bi$_{55}$ sample has its origin in slightly shape-anisotropic structures.
△ Less
Submitted 26 February, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks
Authors:
Denis Emelin,
Ivan Titov,
Rico Sennrich
Abstract:
Word sense disambiguation is a well-known source of translation errors in NMT. We posit that some of the incorrect disambiguation choices are due to models' over-reliance on dataset artifacts found in training data, specifically superficial word co-occurrences, rather than a deeper understanding of the source text. We introduce a method for the prediction of disambiguation errors based on statisti…
▽ More
Word sense disambiguation is a well-known source of translation errors in NMT. We posit that some of the incorrect disambiguation choices are due to models' over-reliance on dataset artifacts found in training data, specifically superficial word co-occurrences, rather than a deeper understanding of the source text. We introduce a method for the prediction of disambiguation errors based on statistical data properties, demonstrating its effectiveness across several domains and model types. Moreover, we develop a simple adversarial attack strategy that minimally perturbs sentences in order to elicit disambiguation errors to further probe the robustness of translation models. Our findings indicate that disambiguation robustness varies substantially between domains and that different models trained on the same data are vulnerable to different attacks.
△ Less
Submitted 3 November, 2020;
originally announced November 2020.
-
Fast Interleaved Bidirectional Sequence Generation
Authors:
Biao Zhang,
Ivan Titov,
Rico Sennrich
Abstract:
Independence assumptions during sequence generation can speed up inference, but parallel generation of highly inter-dependent tokens comes at a cost in quality. Instead of assuming independence between neighbouring tokens (semi-autoregressive decoding, SA), we take inspiration from bidirectional sequence generation and introduce a decoder that generates target words from the left-to-right and righ…
▽ More
Independence assumptions during sequence generation can speed up inference, but parallel generation of highly inter-dependent tokens comes at a cost in quality. Instead of assuming independence between neighbouring tokens (semi-autoregressive decoding, SA), we take inspiration from bidirectional sequence generation and introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously. We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder by simply interleaving the two directions and adapting the word positions and self-attention masks. Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer, and on five machine translation tasks and two document summarization tasks, achieves a decoding speedup of ~2X compared to autoregressive decoding with comparable quality. Notably, it outperforms left-to-right SA because the independence assumptions in IBDecoder are more felicitous. To achieve even higher speedups, we explore hybrid models where we either simultaneously predict multiple neighbouring tokens per direction, or perform multi-directional decoding by partitioning the target sequence. These methods achieve speedups to 4X-11X across different tasks at the cost of <1 BLEU or <0.5 ROUGE (on average). Source code is released at https://github.com/bzhangGo/zero.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
A Differentiable Relaxation of Graph Segmentation and Alignment for AMR Parsing
Authors:
Chunchuan Lyu,
Shay B. Cohen,
Ivan Titov
Abstract:
Abstract Meaning Representations (AMR) are a broad-coverage semantic formalism which represents sentence meaning as a directed acyclic graph. To train most AMR parsers, one needs to segment the graph into subgraphs and align each such subgraph to a word in a sentence; this is normally done at preprocessing, relying on hand-crafted rules. In contrast, we treat both alignment and segmentation as lat…
▽ More
Abstract Meaning Representations (AMR) are a broad-coverage semantic formalism which represents sentence meaning as a directed acyclic graph. To train most AMR parsers, one needs to segment the graph into subgraphs and align each such subgraph to a word in a sentence; this is normally done at preprocessing, relying on hand-crafted rules. In contrast, we treat both alignment and segmentation as latent variables in our model and induce them as part of end-to-end training.
As marginalizing over the structured latent variables is infeasible, we use the variational autoencoding framework.
To ensure end-to-end differentiable optimization, we introduce a differentiable relaxation of the segmentation and alignment problems. We observe that inducing segmentation yields substantial gains over using a `greedy' segmentation heuristic. The performance of our method also approaches that of a model that relies on the segmentation rules of \citet{lyu-titov-2018-amr}, which were hand-crafted to handle individual AMR constructions.
△ Less
Submitted 24 October, 2022; v1 submitted 23 October, 2020;
originally announced October 2020.
-
Meta-Learning for Domain Generalization in Semantic Parsing
Authors:
Bailin Wang,
Mirella Lapata,
Ivan Titov
Abstract:
The importance of building semantic parsers which can be applied to new domains and generate programs unseen at training has long been acknowledged, and datasets testing out-of-domain performance are becoming increasingly available. However, little or no attention has been devoted to learning algorithms or objectives which promote domain generalization, with virtually all existing approaches relyi…
▽ More
The importance of building semantic parsers which can be applied to new domains and generate programs unseen at training has long been acknowledged, and datasets testing out-of-domain performance are becoming increasingly available. However, little or no attention has been devoted to learning algorithms or objectives which promote domain generalization, with virtually all existing approaches relying on standard supervised learning. In this work, we use a meta-learning framework which targets zero-shot domain generalization for semantic parsing. We apply a model-agnostic training algorithm that simulates zero-shot parsing by constructing virtual train and test sets from disjoint domains. The learning objective capitalizes on the intuition that gradient steps that improve source-domain performance should also improve target-domain performance, thus encouraging a parser to generalize to unseen target domains. Experimental results on the (English) Spider and Chinese Spider datasets show that the meta-learning objective significantly boosts the performance of a baseline parser.
△ Less
Submitted 12 April, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation
Authors:
Elena Voita,
Rico Sennrich,
Ivan Titov
Abstract:
In Neural Machine Translation (and, more generally, conditional language modeling), the generation of a target token is influenced by two types of context: the source and the prefix of the target sequence. While many attempts to understand the internal workings of NMT models have been made, none of them explicitly evaluates relative source and target contributions to a generation decision. We argu…
▽ More
In Neural Machine Translation (and, more generally, conditional language modeling), the generation of a target token is influenced by two types of context: the source and the prefix of the target sequence. While many attempts to understand the internal workings of NMT models have been made, none of them explicitly evaluates relative source and target contributions to a generation decision. We argue that this relative contribution can be evaluated by adopting a variant of Layerwise Relevance Propagation (LRP). Its underlying 'conservation principle' makes relevance propagation unique: differently from other methods, it evaluates not an abstract quantity reflecting token importance, but the proportion of each token's influence. We extend LRP to the Transformer and conduct an analysis of NMT models which explicitly evaluates the source and target relative contributions to the generation process. We analyze changes in these contributions when conditioning on different types of prefixes, when varying the training objective or the amount of training data, and during the training process. We find that models trained with more data tend to rely on source information more and to have more sharp token contributions; the training process is non-monotonic with several stages of different nature.
△ Less
Submitted 25 June, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
Adaptive Feature Selection for End-to-End Speech Translation
Authors:
Biao Zhang,
Ivan Titov,
Barry Haddow,
Rico Sennrich
Abstract:
Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to SR. A S…
▽ More
Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to SR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech En-Fr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out ~84% temporal features, yielding an average translation gain of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation)
△ Less
Submitted 20 October, 2020; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking
Authors:
Michael Sejr Schlichtkrull,
Nicola De Cao,
Ivan Titov
Abstract:
Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. However, there has been little work on interpreting them, and specifically on understanding which parts of the graphs (e.g. syntactic trees or co-reference structures) contribute to a prediction. In this work, we introduce a post-hoc method for interpreting the predictions of GNN…
▽ More
Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. However, there has been little work on interpreting them, and specifically on understanding which parts of the graphs (e.g. syntactic trees or co-reference structures) contribute to a prediction. In this work, we introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. Given a trained GNN model, we learn a simple classifier that, for every edge in every layer, predicts if that edge can be dropped. We demonstrate that such a classifier can be trained in a fully differentiable fashion, employing stochastic gates and encouraging sparsity through the expected $L_0$ norm. We use our technique as an attribution method to analyze GNN models for two tasks -- question answering and semantic role labeling -- providing insights into the information flow in these models. We show that we can drop a large proportion of edges without deteriorating the performance of the model, while we can analyse the remaining edges for interpreting model predictions.
△ Less
Submitted 3 October, 2022; v1 submitted 1 October, 2020;
originally announced October 2020.
-
Visually Grounded Compound PCFGs
Authors:
Yanpeng Zhao,
Ivan Titov
Abstract:
Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings. Existing work on this task (Shi et al., 2019) optimizes a parser via Reinforce and derives the learning signal only from the alignment of images and sentences.…
▽ More
Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings. Existing work on this task (Shi et al., 2019) optimizes a parser via Reinforce and derives the learning signal only from the alignment of images and sentences. While their model is relatively accurate overall, its error distribution is very uneven, with low performance on certain constituents types (e.g., 26.2% recall on verb phrases, VPs) and high on others (e.g., 79.6% recall on noun phrases, NPs). This is not surprising as the learning signal is likely insufficient for deriving all aspects of phrase-structure syntax and gradient estimates are noisy. We show that using an extension of probabilistic context-free grammar model we can do fully-differentiable end-to-end visually grounded learning. Additionally, this enables us to complement the image-text alignment loss with a language modeling objective. On the MSCOCO test captions, our model establishes a new state of the art, outperforming its non-grounded version and, thus, confirming the effectiveness of visual groundings in constituency grammar induction. It also substantially outperforms the previous grounded model, with largest improvements on more `abstract' categories (e.g., +55.1% recall on VPs).
△ Less
Submitted 25 September, 2020;
originally announced September 2020.
-
Non-linear Breit-Wheeler process with linearly polarized beams
Authors:
Alexander I. Titov,
Burkhard Kampfer
Abstract:
We study the non-linear Breit-Wheeler process $\vec γ' + \vec L \to e^+ + e^-$ in the interaction of linearly polarized probe photons ($\vec γ'$) with a linearly polarized laser beam ($\vec L$). In particular, we consider the asymmetry of the total cross section and the azimuthal electron distributions when the polarizations of the photon and laser beams in the initial state are mutually perpendic…
▽ More
We study the non-linear Breit-Wheeler process $\vec γ' + \vec L \to e^+ + e^-$ in the interaction of linearly polarized probe photons ($\vec γ'$) with a linearly polarized laser beam ($\vec L$). In particular, we consider the asymmetry of the total cross section and the azimuthal electron distributions when the polarizations of the photon and laser beams in the initial state are mutually perpendicular or parallel. Considering intense laser beams and the strong field asymptotic we explore essentially the multi-photon dynamics. The asymmetry exhibits some non-monotonic behavior depending on initial kinematic conditions; it depends sensitively on the laser pulse duration. Our results provide additional knowledge for studying non-linear multi-photon effects in quantum electrodynamics and may be used in planning experiments in upcoming laser facilities.
△ Less
Submitted 18 November, 2020; v1 submitted 8 June, 2020;
originally announced June 2020.
-
Anisometric mesoscale nuclear and magnetic texture in sintered Nd-Fe-B magnets
Authors:
I. Titov,
D. Honecker,
D. Mettus,
A. Feoktystov,
J. Kohlbrecher,
P. Strunz,
A. Michels
Abstract:
By means of temperature and wavelength-dependent small-angle neutron scattering (SANS) experiments on sintered isotropic and textured Nd-Fe-B magnets we provide evidence for the existence of an anisometric structure in the microstructure of the textured magnets. This conclusion is reached by observing a characteristic cross-shaped angular anisotropy in the total unpolarized SANS cross section at t…
▽ More
By means of temperature and wavelength-dependent small-angle neutron scattering (SANS) experiments on sintered isotropic and textured Nd-Fe-B magnets we provide evidence for the existence of an anisometric structure in the microstructure of the textured magnets. This conclusion is reached by observing a characteristic cross-shaped angular anisotropy in the total unpolarized SANS cross section at temperatures well above the Curie temperature. Comparison of the experimental SANS data to a microstructural model based on the superquadrics form factor allows us to estimate the shape and lower bounds for the size of the structure. Subtraction of the scattering cross section in the paramagnetic regime from data taken at room temperature provides the magnetic SANS cross section. Surprisingly, the anisotropy of the magnetic scattering is very similar to the nuclear SANS signal, suggesting that the nuclear structure is decorated by the magnetic moments via spin-orbit coupling. Based on the computation of the two-dimensional correlation function we estimate lower bounds for the longitudinal and transversal magnetic correlation lengths.
△ Less
Submitted 12 May, 2020;
originally announced May 2020.
-
Unsupervised Transfer of Semantic Role Models from Verbal to Nominal Domain
Authors:
Yanpeng Zhao,
Ivan Titov
Abstract:
Semantic role labeling (SRL) is an NLP task involving the assignment of predicate arguments to types, called semantic roles. Though research on SRL has primarily focused on verbal predicates and many resources available for SRL provide annotations only for verbs, semantic relations are often triggered by other linguistic constructions, e.g., nominalizations. In this work, we investigate a transfer…
▽ More
Semantic role labeling (SRL) is an NLP task involving the assignment of predicate arguments to types, called semantic roles. Though research on SRL has primarily focused on verbal predicates and many resources available for SRL provide annotations only for verbs, semantic relations are often triggered by other linguistic constructions, e.g., nominalizations. In this work, we investigate a transfer scenario where we assume role-annotated data for the source verbal domain but only unlabeled data for the target nominal domain. Our key assumption, enabling the transfer between the two domains, is that selectional preferences of a role (i.e., preferences or constraints on the admissible arguments) do not strongly depend on whether the relation is triggered by a verb or a noun. For example, the same set of arguments can fill the Acquirer role for the verbal predicate `acquire' and its nominal form `acquisition'. We approach the transfer task from the variational autoencoding perspective. The labeler serves as an encoder (predicting role labels given a sentence), whereas selectional preferences are captured in the decoder component (generating arguments for the predicting roles). Nominal roles are not labeled in the training data, and the learning objective instead pushes the labeler to assign roles predictive of the arguments. Sharing the decoder parameters across the domains encourages consistency between labels predicted for both domains and facilitates the transfer. The method substantially outperforms baselines, such as unsupervised and `direct transfer' methods, on the English CoNLL-2009 dataset.
△ Less
Submitted 26 September, 2020; v1 submitted 1 May, 2020;
originally announced May 2020.