Search | arXiv e-print repository

Code Pretraining Improves Entity Tracking Abilities of Language Models

Authors: Najoung Kim, Sebastian Schuster, Shubham Toshniwal

Abstract: Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim by comparing pairs of language models on their entity tracking performance. Critically, the pairs consist of base models and models trained on top of these base… ▽ More Recent work has provided indirect evidence that pretraining language models on code improves the ability of models to track state changes of discourse entities expressed in natural language. In this work, we systematically test this claim by comparing pairs of language models on their entity tracking performance. Critically, the pairs consist of base models and models trained on top of these base models with additional code data. We extend this analysis to additionally examine the effect of math training, another highly structured data type, and alignment tuning, an important step for enhancing the usability of models. We find clear evidence that models additionally trained on large amounts of code outperform the base models. On the other hand, we find no consistent benefit of additional math training or alignment tuning across various model families. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2404.04332 [pdf, other]

doi 10.1162/tacl_a_00670

Scope Ambiguities in Large Language Models

Authors: Gaurav Kamath, Sebastian Schuster, Sowmya Vajjala, Siva Reddy

Abstract: Sentences containing multiple semantic operators with overlap** scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate h… ▽ More Sentences containing multiple semantic operators with overlap** scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models -- GPT-2, GPT-3/3.5, Llama 2 and GPT-4 -- treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases). △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: To be published in Transactions of the Association for Computational Linguistics

arXiv:2305.02363 [pdf, other]

Entity Tracking in Language Models

Authors: Najoung Kim, Sebastian Schuster

Abstract: Kee** track of how states of entities change as a text or dialog unfolds is a key prerequisite to discourse understanding. Yet, there have been few systematic investigations into the ability of large language models (LLMs) to track discourse entities. In this work, we present a task probing to what extent a language model can infer the final state of an entity given an English description of the… ▽ More Kee** track of how states of entities change as a text or dialog unfolds is a key prerequisite to discourse understanding. Yet, there have been few systematic investigations into the ability of large language models (LLMs) to track discourse entities. In this work, we present a task probing to what extent a language model can infer the final state of an entity given an English description of the initial state and a series of state-changing operations. We use this task to first investigate whether Flan-T5, GPT-3 and GPT-3.5 can track the state of entities, and find that only GPT-3.5 models, which have been pretrained on large amounts of code, exhibit this ability. We then investigate whether smaller models pretrained primarily on text can learn to track entities, through finetuning T5 on several training/evaluation splits. While performance degrades for more complex splits, we find that even when evaluated on a different set of entities from training or longer operation sequences, a finetuned model can perform non-trivial entity tracking. Taken together, these results suggest that language models can learn to track entities but pretraining on text corpora alone does not make this capacity surface. △ Less

Submitted 8 September, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

Comments: ACL 2023 Camera-ready

arXiv:2304.04758 [pdf, other]

Expectations over Unspoken Alternatives Predict Pragmatic Inferences

Authors: Jennifer Hu, Roger Levy, Judith Degen, Sebastian Schuster

Abstract: Scalar inferences (SI) are a signature example of how humans interpret language based on unspoken alternatives. While empirical studies have demonstrated that human SI rates are highly variable -- both within instances of a single scale, and across different scales -- there have been few proposals that quantitatively explain both cross- and within-scale variation. Furthermore, while it is generall… ▽ More Scalar inferences (SI) are a signature example of how humans interpret language based on unspoken alternatives. While empirical studies have demonstrated that human SI rates are highly variable -- both within instances of a single scale, and across different scales -- there have been few proposals that quantitatively explain both cross- and within-scale variation. Furthermore, while it is generally assumed that SIs arise through reasoning about unspoken alternatives, it remains debated whether humans reason about alternatives as linguistic forms, or at the level of concepts. Here, we test a shared mechanism explaining SI rates within and across scales: context-driven expectations about the unspoken alternatives. Using neural language models to approximate human predictive distributions, we find that SI rates are captured by the expectedness of the strong scalemate as an alternative. Crucially, however, expectedness robustly predicts cross-scale variation only under a meaning-based view of alternatives. Our results suggest that pragmatic inferences arise from context-driven expectations over alternatives, and these expectations operate at the level of concepts. △ Less

Submitted 7 April, 2023; originally announced April 2023.

Comments: To appear in TACL (pre-MIT Press publication version)

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2205.03472 [pdf, other]

When a sentence does not introduce a discourse entity, Transformer-based models still sometimes refer to it

Authors: Sebastian Schuster, Tal Linzen

Abstract: Understanding longer narratives or participating in conversations requires tracking of discourse entities that have been mentioned. Indefinite noun phrases (NPs), such as 'a dog', frequently introduce discourse entities but this behavior is modulated by sentential operators such as negation. For example, 'a dog' in 'Arthur doesn't own a dog' does not introduce a discourse entity due to the presenc… ▽ More Understanding longer narratives or participating in conversations requires tracking of discourse entities that have been mentioned. Indefinite noun phrases (NPs), such as 'a dog', frequently introduce discourse entities but this behavior is modulated by sentential operators such as negation. For example, 'a dog' in 'Arthur doesn't own a dog' does not introduce a discourse entity due to the presence of negation. In this work, we adapt the psycholinguistic assessment of language models paradigm to higher-level linguistic phenomena and introduce an English evaluation suite that targets the knowledge of the interactions between sentential operators and indefinite NPs. We use this evaluation suite for a fine-grained investigation of the entity tracking abilities of the Transformer-based models GPT-2 and GPT-3. We find that while the models are to a certain extent sensitive to the interactions we investigate, they are all challenged by the presence of multiple NPs and their behavior is not systematic, which suggests that even models at the scale of GPT-3 do not fully acquire basic entity tracking abilities. △ Less

Submitted 6 May, 2022; originally announced May 2022.

Comments: To appear at NAACL 2022

arXiv:2203.13094 [pdf]

Six Insights into 6G: Orientation and Input for Develo** Your Strategic 6G Research Plan

Authors: Kimberley Parsons Trommler, Matthias Hafner, Wolfgang Kellerer, Peter Merz, Sigurd Schuster, Josef Urban, Uwe Baeder, Bertram Gunzelmann, Andreas Kornbichler

Abstract: This paper is a summary of the findings from a series of workshops which were held by Thinknet 6G and MUENCHNER KREIS in 2021, with the goal to provide orientation and input for develo** a strategic 6G research plan. The topics selected for the workshops are aspects of 6G that we expect will have a significant impact on other industries and on society: - 6G as both a communication infrastructu… ▽ More This paper is a summary of the findings from a series of workshops which were held by Thinknet 6G and MUENCHNER KREIS in 2021, with the goal to provide orientation and input for develo** a strategic 6G research plan. The topics selected for the workshops are aspects of 6G that we expect will have a significant impact on other industries and on society: - 6G as both a communication infrastructure and a sensing infrastructure - The extensive use of artificial intelligence in 6G - The security and resilience of 6G This paper does not go into the technical details of how to develop and implement 6G. Rather, it provides input from experts from both the wireless industry as well as from other sectors about (mostly) non-technical topics that will need to be addressed in parallel with the technical developments, such as new use cases, regulation, communication with the public, and cross-industry cooperation. We have identified six areas that will have a significant impact on the development and use of 6G, and that organizations must consider as they begin their plans and designs for 6G. Based on these six impact areas and on the discussion in the workshops, we compiled a list of the top 10 recommendations for specific areas where organizations should place their focus when develo** their strategic plan for 6G. In addition, for our readers who are involved in 6G research, be it at a university, at a research institute or in industrial research, we also included a summary of the top 10 areas that require additional research, again based on the input received in the workshops. A version of this paper is also available at www.thinknet-6g.de. If you had a copy of the preview version of this paper, the text is exactly the same. Only the layout and graphics have changed. △ Less

Submitted 20 May, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: 20 pages, 1 figure

Report number: Thinknet-6G-2022-May-01 ACM Class: C.2.0; C.2.1; C.2.6

arXiv:2203.09397 [pdf, other]

Coloring the Blank Slate: Pre-training Imparts a Hierarchical Inductive Bias to Sequence-to-sequence Models

Authors: Aaron Mueller, Robert Frank, Tal Linzen, Luheng Wang, Sebastian Schuster

Abstract: Relations between words are governed by hierarchical structure rather than linear ordering. Sequence-to-sequence (seq2seq) models, despite their success in downstream NLP applications, often fail to generalize in a hierarchy-sensitive manner when performing syntactic transformations - for example, transforming declarative sentences into questions. However, syntactic evaluations of seq2seq models h… ▽ More Relations between words are governed by hierarchical structure rather than linear ordering. Sequence-to-sequence (seq2seq) models, despite their success in downstream NLP applications, often fail to generalize in a hierarchy-sensitive manner when performing syntactic transformations - for example, transforming declarative sentences into questions. However, syntactic evaluations of seq2seq models have only observed models that were not pre-trained on natural language data before being trained to perform syntactic transformations, in spite of the fact that pre-training has been found to induce hierarchical linguistic generalizations in language models; in other words, the syntactic capabilities of seq2seq models may have been greatly understated. We address this gap using the pre-trained seq2seq models T5 and BART, as well as their multilingual variants mT5 and mBART. We evaluate whether they generalize hierarchically on two transformations in two languages: question formation and passivization in English and German. We find that pre-trained seq2seq models generalize hierarchically when performing syntactic transformations, whereas models trained from scratch on syntactic transformations do not. This result presents evidence for the learnability of hierarchical syntactic information from non-annotated natural language text while also demonstrating that seq2seq models are capable of syntactic generalization, though only after exposure to much more language data than human learners receive. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: Accepted to Findings of ACL 2022

arXiv:2201.12266 [pdf]

Six Questions about 6G

Authors: Kimberley Parsons Trommler, Matthias Hafner, Wolfgang Kellerer, Peter Merz, Sigurd Schuster, Josef Urban, Uwe Baeder, Bertram Gunzelmann, Andreas Kornbichler

Abstract: Although 5G (Fifth Generation) mobile technology is still in the rollout phase, research and development of 6G (Sixth Generation) wireless have already begun. This paper is an introduction to 6G wireless networks, covering the main drivers for 6G, some of the expected use cases, some of the technical challenges in 6G, example areas that will require research and new technologies, the expected time… ▽ More Although 5G (Fifth Generation) mobile technology is still in the rollout phase, research and development of 6G (Sixth Generation) wireless have already begun. This paper is an introduction to 6G wireless networks, covering the main drivers for 6G, some of the expected use cases, some of the technical challenges in 6G, example areas that will require research and new technologies, the expected timeline for 6G development and rollout, and a list of some important 6G initiatives world-wide. It was compiled as part of a series of workshops about 6G held by Thinknet 6G and MUENCHNER KREIS in 2021. △ Less

Submitted 7 February, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

Comments: 6 pages, 3 figures, document also available in German, document available in a more attractive format, here: www.thinknet-6g.de

ACM Class: C.2.0; C.2.1; C.2.6

arXiv:2109.06987 [pdf, other]

NOPE: A Corpus of Naturally-Occurring Presuppositions in English

Authors: Alicia Parrish, Sebastian Schuster, Alex Warstadt, Omar Agha, Soo-Hwan Lee, Zhuoye Zhao, Samuel R. Bowman, Tal Linzen

Abstract: Understanding language requires gras** not only the overtly stated content, but also making inferences about things that were left unsaid. These inferences include presuppositions, a phenomenon by which a listener learns about new information through reasoning about what a speaker takes as given. Presuppositions require complex understanding of the lexical and syntactic properties that trigger t… ▽ More Understanding language requires gras** not only the overtly stated content, but also making inferences about things that were left unsaid. These inferences include presuppositions, a phenomenon by which a listener learns about new information through reasoning about what a speaker takes as given. Presuppositions require complex understanding of the lexical and syntactic properties that trigger them as well as the broader conversational context. In this work, we introduce the Naturally-Occurring Presuppositions in English (NOPE) Corpus to investigate the context-sensitivity of 10 different types of presupposition triggers and to evaluate machine learning models' ability to predict human inferences. We find that most of the triggers we investigate exhibit moderate variability. We further find that transformer-based models draw correct inferences in simple cases involving presuppositions, but they fail to capture the minority of exceptional cases in which human judgments reveal complex interactions between context and triggers. △ Less

Submitted 14 September, 2021; originally announced September 2021.

Comments: CoNLL 2021. Data and code available at https://github.com/nyu-mll/nope

arXiv:2004.10643 [pdf, other]

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Authors: Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

Abstract: Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on… ▽ More Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages. △ Less

Submitted 22 April, 2020; originally announced April 2020.

Comments: LREC 2020

arXiv:1910.14254 [pdf, other]

Harnessing the linguistic signal to predict scalar inferences

Authors: Sebastian Schuster, Yuxing Chen, Judith Degen

Abstract: Pragmatic inferences often subtly depend on the presence or absence of linguistic features. For example, the presence of a partitive construction (of the) increases the strength of a so-called scalar inference: listeners perceive the inference that Chris did not eat all of the cookies to be stronger after hearing "Chris ate some of the cookies" than after hearing the same utterance without a parti… ▽ More Pragmatic inferences often subtly depend on the presence or absence of linguistic features. For example, the presence of a partitive construction (of the) increases the strength of a so-called scalar inference: listeners perceive the inference that Chris did not eat all of the cookies to be stronger after hearing "Chris ate some of the cookies" than after hearing the same utterance without a partitive, "Chris ate some cookies." In this work, we explore to what extent neural network sentence encoders can learn to predict the strength of scalar inferences. We first show that an LSTM-based sentence encoder trained on an English dataset of human inference strength ratings is able to predict ratings with high accuracy (r=0.78). We then probe the model's behavior using manually constructed minimal sentence pairs and corpus data. We find that the model inferred previously established associations between linguistic features and inference strength, suggesting that the model learns to use linguistic features to predict pragmatic inferences. △ Less

Submitted 22 April, 2020; v1 submitted 31 October, 2019; originally announced October 2019.

Comments: ACL 2020; 16 pages, 8 figures

arXiv:1810.13327 [pdf, other]

Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog

Authors: Sebastian Schuster, Sonal Gupta, Rushin Shah, Mike Lewis

Abstract: One of the first steps in the utterance interpretation pipeline of many task-oriented conversational AI systems is to identify user intents and the corresponding slots. Since data collection for machine learning models for this task is time-consuming, it is desirable to make use of existing data in a high-resource language to train models in low-resource languages. However, development of such mod… ▽ More One of the first steps in the utterance interpretation pipeline of many task-oriented conversational AI systems is to identify user intents and the corresponding slots. Since data collection for machine learning models for this task is time-consuming, it is desirable to make use of existing data in a high-resource language to train models in low-resource languages. However, development of such models has largely been hindered by the lack of multilingual training data. In this paper, we present a new data set of 57k annotated utterances in English (43k), Spanish (8.6k) and Thai (5k) across the domains weather, alarm, and reminder. We use this data set to evaluate three different cross-lingual transfer methods: (1) translating the training data, (2) using cross-lingual pre-trained embeddings, and (3) a novel method of using a multilingual machine translation encoder as contextual word representations. We find that given several hundred training examples in the the target language, the latter two methods outperform translating the training data. Further, in very low-resource settings, multilingual contextual word representations give better results than using cross-lingual static embeddings. We also compare the cross-lingual methods to using monolingual resources in the form of contextual ELMo representations and find that given just small amounts of target language data, this method outperforms all cross-lingual methods, which highlights the need for more sophisticated cross-lingual methods. △ Less

Submitted 1 April, 2019; v1 submitted 31 October, 2018; originally announced October 2018.

Comments: 11 pages, to be presented at NAACL 2019

arXiv:1804.06922 [pdf, other]

Sentences with Gap**: Parsing and Reconstructing Elided Predicates

Authors: Sebastian Schuster, Joakim Nivre, Christopher D. Manning

Abstract: Sentences with gap**, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments. Surface syntax representations of such sentences are often produced poorly by parsers, and even if correct, not well suited to downstream natural language understanding tasks such as relation extraction that are typically designed to extract information fr… ▽ More Sentences with gap**, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments. Surface syntax representations of such sentences are often produced poorly by parsers, and even if correct, not well suited to downstream natural language understanding tasks such as relation extraction that are typically designed to extract information from sentences with canonical clause structure. In this paper, we present two methods for parsing to a Universal Dependencies graph representation that explicitly encodes the elided material with additional nodes and edges. We find that both methods can reconstruct elided material from dependency trees with high accuracy when the parser correctly predicts the existence of a gap. We further demonstrate that one of our methods can be applied to other languages based on a case study on Swedish. △ Less

Submitted 18 April, 2018; originally announced April 2018.

Comments: To be presented at NAACL 2018

Journal ref: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018)

Showing 1–14 of 14 results for author: Schuster, S