Search | arXiv e-print repository

Modeling Ideological Salience and Framing in Polarized Online Groups with Graph Neural Networks and Structured Sparsity

Authors: Valentin Hofmann, Xiaowen Dong, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. We introduce a minimally supervised method that leverages the network structure of online discussion forums, specifically Reddit, to detect polarized concepts. We model polarization along the dimensions of salience and framing, drawing u… ▽ More The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. We introduce a minimally supervised method that leverages the network structure of online discussion forums, specifically Reddit, to detect polarized concepts. We model polarization along the dimensions of salience and framing, drawing upon insights from moral psychology. Our architecture combines graph neural networks with structured sparsity learning and results in representations for concepts and subreddits that capture temporal ideological dynamics such as right-wing and left-wing radicalization. △ Less

Submitted 14 December, 2022; v1 submitted 18 April, 2021; originally announced April 2021.

Comments: NAACL 2022 (Findings)

arXiv:2104.08551 [pdf, other]

Multi-source Neural Topic Modeling in Multi-view Embedding Spaces

Authors: Pankaj Gupta, Yatin Chaudhary, Hinrich Schütze

Abstract: Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity in short-text or small collection of documents. This work presents a novel neural topic modeling framework using multi-view embedding spaces: (1) pretrained topic-embeddings, and (2) pretrained word-embeddings (context… ▽ More Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity in short-text or small collection of documents. This work presents a novel neural topic modeling framework using multi-view embedding spaces: (1) pretrained topic-embeddings, and (2) pretrained word-embeddings (context insensitive from Glove and context-sensitive from BERT models) jointly from one or many sources to improve topic quality and better deal with polysemy. In doing so, we first build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). We then identify one or more relevant source domain(s) and transfer knowledge to guide meaningful learning in the sparse target domain. Within neural topic modeling, we quantify the quality of topics and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. Introducing the multi-source multi-view embedding spaces, we have shown state-of-the-art neural topic modeling using 6 source (high-resource) and 5 target (low-resource) corpora. △ Less

Submitted 17 April, 2021; originally announced April 2021.

Comments: NAACL2021, 13 pages, 14 tables, 2 figures. arXiv admin note: substantial text overlap with arXiv:1909.06563

arXiv:2104.08401 [pdf, ps, other]

Enriching a Model's Notion of Belief using a Persistent Memory

Authors: Nora Kassner, Oyvind Tafjord, Hinrich Schutze, Peter Clark

Abstract: Although pretrained language models (PTLMs) have been shown to contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after using specialized training techniques to reduce inconsistency. As a result, it can be hard to identify what the model actually "believes" about the world. Our goal is to reduce this problem, so systems are mo… ▽ More Although pretrained language models (PTLMs) have been shown to contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after using specialized training techniques to reduce inconsistency. As a result, it can be hard to identify what the model actually "believes" about the world. Our goal is to reduce this problem, so systems are more globally consistent and accurate in their answers. Our approach is to add a memory component -- a BeliefBank -- that records a model's answers, and two mechanisms that use it to improve consistency among beliefs. First, a reasoning component -- a weighted SAT solver -- improves consistency by flip** answers that significantly clash with others. Second, a feedback component re-queries the model but using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms improve both accuracy and consistency. This is significant as it is a first step towards endowing models with an evolving memory, allowing them to construct a more coherent picture of the world. △ Less

Submitted 7 October, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: This is an old and now obsolete draft. See arXiv:2109.14723 ("BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief") for the final paper

arXiv:2104.07540 [pdf, other]

Generating Datasets with Pretrained Language Models

Authors: Timo Schick, Hinrich Schütze

Abstract: To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obta… ▽ More To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets. △ Less

Submitted 4 October, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

Comments: Accepted at EMNLP2021

arXiv:2104.07094 [pdf, other]

Static Embeddings as Efficient Knowledge Bases?

Authors: Philipp Dufter, Nora Kassner, Hinrich Schütze

Abstract: Recent research investigates factual knowledge stored in large pretrained language models (PLMs). Instead of structural knowledge base (KB) queries, masked sentences such as "Paris is the capital of [MASK]" are used as probes. The good performance on this analysis task has been interpreted as PLMs becoming potential repositories of factual knowledge. In experiments across ten linguistically divers… ▽ More Recent research investigates factual knowledge stored in large pretrained language models (PLMs). Instead of structural knowledge base (KB) queries, masked sentences such as "Paris is the capital of [MASK]" are used as probes. The good performance on this analysis task has been interpreted as PLMs becoming potential repositories of factual knowledge. In experiments across ten linguistically diverse languages, we study knowledge contained in static embeddings. We show that, when restricting the output space to a candidate set, simple nearest neighbor matching using static embeddings performs better than PLMs. E.g., static embeddings perform 1.6% points better than BERT while just using 0.3% of energy for training. One important factor in their good comparative performance is that static embeddings are standardly learned for a large vocabulary. In contrast, BERT exploits its more sophisticated, but expensive ability to compose meaningful representations from a much smaller subword vocabulary. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: NAACL2021 CRV; first two authors contributed equally

arXiv:2103.05131 [pdf, other]

Few-Shot Learning of an Interleaved Text Summarization Model by Pretraining with Synthetic Data

Authors: Sanjeev Kumar Karn, Francine Chen, Yan-Ying Chen, Ulli Waltinger, Hinrich Schuetze

Abstract: Interleaved texts, where posts belonging to different threads occur in a sequence, commonly occur in online chat posts, so that it can be time-consuming to quickly obtain an overview of the discussions. Existing systems first disentangle the posts by threads and then extract summaries from those threads. A major issue with such systems is error propagation from the disentanglement component. While… ▽ More Interleaved texts, where posts belonging to different threads occur in a sequence, commonly occur in online chat posts, so that it can be time-consuming to quickly obtain an overview of the discussions. Existing systems first disentangle the posts by threads and then extract summaries from those threads. A major issue with such systems is error propagation from the disentanglement component. While end-to-end trainable summarization system could obviate explicit disentanglement, such systems require a large amount of labeled data. To address this, we propose to pretrain an end-to-end trainable hierarchical encoder-decoder system using synthetic interleaved texts. We show that by fine-tuning on a real-world meeting dataset (AMI), such a system out-performs a traditional two-step system by 22%. We also compare against transformer models and observed that pretraining with synthetic data both the encoder and decoder outperforms the BertSumExtAbs transformer model which pretrains only the encoder on a large dataset. △ Less

Submitted 8 March, 2021; originally announced March 2021.

Comments: Adapt-NLP: The Second Workshop on Domain Adaptation for NLP

arXiv:2103.00453 [pdf, other]

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

Authors: Timo Schick, Sahana Udupa, Hinrich Schütze

Abstract: When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In t… ▽ More When trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. As large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we first demonstrate a surprising finding: pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce. We refer to this capability as self-diagnosis. Based on this finding, we then propose a decoding algorithm that, given only a textual description of the undesired behavior, reduces the probability of a language model producing problematic text. We refer to this approach as self-debiasing. Self-debiasing does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. While we by no means eliminate the issue of language models generating biased text, we believe our approach to be an important step in this direction. △ Less

Submitted 9 September, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

Comments: Accepted at TACL

arXiv:2102.11090 [pdf, other]

Position Information in Transformers: An Overview

Authors: Philipp Dufter, Martin Schmitt, Hinrich Schütze

Abstract: Transformers are arguably the main workhorse in recent Natural Language Processing research. By definition a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this article, we provide an overview and theoretical comparison of existing methods to incorporate positio… ▽ More Transformers are arguably the main workhorse in recent Natural Language Processing research. By definition a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this article, we provide an overview and theoretical comparison of existing methods to incorporate position information into Transformer models. The objectives of this survey are to (1) showcase that position information in Transformer is a vibrant and extensive research area; (2) enable the reader to compare existing methods by providing a unified notation and systematization of different approaches along important model dimensions; (3) indicate what characteristics of an application should be taken into account when selecting a position encoding; (4) provide stimuli for future research. △ Less

Submitted 9 September, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: First two authors contributed equally

arXiv:2102.05331 [pdf, other]

Language Models for Lexical Inference in Context

Authors: Martin Schmitt, Hinrich Schütze

Abstract: Lexical inference in context (LIiC) is the task of recognizing textual entailment between two very similar sentences, i.e., sentences that only differ in one expression. It can therefore be seen as a variant of the natural language inference task that is focused on lexical semantics. We formulate and evaluate the first approaches based on pretrained language models (LMs) for this task: (i) a few-s… ▽ More Lexical inference in context (LIiC) is the task of recognizing textual entailment between two very similar sentences, i.e., sentences that only differ in one expression. It can therefore be seen as a variant of the natural language inference task that is focused on lexical semantics. We formulate and evaluate the first approaches based on pretrained language models (LMs) for this task: (i) a few-shot NLI classifier, (ii) a relation induction approach based on handcrafted patterns expressing the semantics of lexical inference, and (iii) a variant of (ii) with patterns that were automatically extracted from a corpus. All our approaches outperform the previous state of the art, showing the potential of pretrained LMs for LIiC. In an extensive analysis, we investigate factors of success and failure of our three approaches. △ Less

Submitted 27 April, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

Comments: Final version of EACL 2021 long paper

arXiv:2102.04760 [pdf, other]

Improving Scene Graph Classification by Exploiting Knowledge from Texts

Authors: Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, Volker Tresp

Abstract: Training scene graph classification models requires a large amount of annotated image data. Meanwhile, scene graphs represent relational knowledge that can be modeled with symbolic data from texts or knowledge graphs. While image annotation demands extensive labor, collecting textual descriptions of natural scenes requires less effort. In this work, we investigate whether textual scene description… ▽ More Training scene graph classification models requires a large amount of annotated image data. Meanwhile, scene graphs represent relational knowledge that can be modeled with symbolic data from texts or knowledge graphs. While image annotation demands extensive labor, collecting textual descriptions of natural scenes requires less effort. In this work, we investigate whether textual scene descriptions can substitute for annotated image data. To this end, we employ a scene graph classification framework that is trained not only from annotated images but also from symbolic data. In our architecture, the symbolic entities are first mapped to their correspondent image-grounded representations and then fed into the relational reasoning pipeline. Even though a structured form of knowledge, such as the form in knowledge graphs, is not always available, we can generate it from unstructured texts using a transformer-based language model. We show that by fine-tuning the classification pipeline with the extracted knowledge from texts, we can achieve ~8x more accurate results in scene graph classification, ~3x in object classification, and ~1.5x in predicate classification, compared to the supervised baselines with only 1% of the annotated images. △ Less

Submitted 8 October, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

arXiv:2102.03596 [pdf, other]

Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models

Authors: Lutfi Kerem Senel, Hinrich Schütze

Abstract: Recent progress in pretraining language models on large corpora has resulted in large performance gains on many NLP tasks. These large models acquire linguistic knowledge during pretraining, which helps to improve performance on downstream tasks via fine-tuning. To assess what kind of knowledge is acquired, language models are commonly probed by querying them with `fill in the blank' style cloze q… ▽ More Recent progress in pretraining language models on large corpora has resulted in large performance gains on many NLP tasks. These large models acquire linguistic knowledge during pretraining, which helps to improve performance on downstream tasks via fine-tuning. To assess what kind of knowledge is acquired, language models are commonly probed by querying them with `fill in the blank' style cloze questions. Existing probing datasets mainly focus on knowledge about relations between words and entities. We introduce WDLMPro (Word Definition Language Model Probing) to evaluate word understanding directly using dictionary definitions of words. In our experiments, three popular pretrained language models struggle to match words and their definitions. This indicates that they understand many words poorly and that our new probing task is a difficult challenge that could help guide research on LMs in the future. △ Less

Submitted 6 February, 2021; originally announced February 2021.

Comments: 5 pages, to appear in EACL 2021

arXiv:2102.01017 [pdf, other]

Measuring and Improving Consistency in Pretrained Language Models

Authors: Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, Yoav Goldberg

Abstract: Consistency of a model -- that is, the invariance of its behavior under meaning-preserving alternations in its input -- is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases… ▽ More Consistency of a model -- that is, the invariance of its behavior under meaning-preserving alternations in its input -- is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel, we show that the consistency of all PLMs we experiment with is poor -- though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness. △ Less

Submitted 29 May, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

Comments: Accepted to the TACL journal, pre-MIT Press publication version

arXiv:2102.00894 [pdf, other]

Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models

Authors: Nora Kassner, Philipp Dufter, Hinrich Schütze

Abstract: Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as "Paris is the capital of [MASK]" are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowle… ▽ More Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as "Paris is the capital of [MASK]" are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT's performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin. △ Less

Submitted 1 February, 2021; originally announced February 2021.

Comments: Accepted to EACL 2021

arXiv:2101.00403 [pdf, other]

Superbizarre Is Not Superb: Derivational Morphology Improves BERT's Interpretation of Complex Words

Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else… ▽ More How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used. △ Less

Submitted 2 June, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

Comments: ACL 2021

arXiv:2012.15682 [pdf, other]

A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters

Authors: Mengjie Zhao, Yi Zhu, Ehsan Shareghi, Ivan Vulić, Roi Reichart, Anna Korhonen, Hinrich Schütze

Abstract: Few-shot crosslingual transfer has been shown to outperform its zero-shot counterpart with pretrained encoders like multilingual BERT. Despite its growing popularity, little to no attention has been paid to standardizing and analyzing the design of few-shot experiments. In this work, we highlight a fundamental risk posed by this shortcoming, illustrating that the model exhibits a high degree of se… ▽ More Few-shot crosslingual transfer has been shown to outperform its zero-shot counterpart with pretrained encoders like multilingual BERT. Despite its growing popularity, little to no attention has been paid to standardizing and analyzing the design of few-shot experiments. In this work, we highlight a fundamental risk posed by this shortcoming, illustrating that the model exhibits a high degree of sensitivity to the selection of few shots. We conduct a large-scale experimental study on 40 sets of sampled few shots for six diverse NLP tasks across up to 40 languages. We provide an analysis of success and failure cases of few-shot transfer, which highlights the role of lexical features. Additionally, we show that a straightforward full model finetuning approach is quite effective for few-shot transfer, outperforming several state-of-the-art few-shot approaches. As a step towards standardizing few-shot crosslingual experimental designs, we make our sampled few shots publicly available. △ Less

Submitted 2 June, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

Comments: ACL-IJCNLP 2021

arXiv:2012.11926 [pdf, other]

Few-Shot Text Generation with Pattern-Exploiting Training

Authors: Timo Schick, Hinrich Schütze

Abstract: Providing pretrained language models with simple task descriptions in natural language enables them to solve some tasks in a fully unsupervised fashion. Moreover, when combined with regular learning from examples, this idea yields impressive few-shot results for a wide range of text classification tasks. It is also a promising direction to improve data efficiency in generative settings, but there… ▽ More Providing pretrained language models with simple task descriptions in natural language enables them to solve some tasks in a fully unsupervised fashion. Moreover, when combined with regular learning from examples, this idea yields impressive few-shot results for a wide range of text classification tasks. It is also a promising direction to improve data efficiency in generative settings, but there are several challenges to using a combination of task descriptions and example-based learning for text generation. In particular, it is crucial to find task descriptions that are easy to understand for the pretrained model and to ensure that it actually makes good use of them; furthermore, effective measures against overfitting have to be implemented. In this paper, we show how these challenges can be tackled: We introduce GenPET, a method for text generation that is based on pattern-exploiting training, a recent approach for combining textual instructions with supervised learning that only works for classification tasks. On several summarization and headline generation datasets, GenPET gives consistent improvements over strong baselines in few-shot settings. △ Less

Submitted 4 October, 2021; v1 submitted 22 December, 2020; originally announced December 2020.

Comments: Accepted at EMNLP2021

arXiv:2012.11657 [pdf, other]

Subword Sampling for Low Resource Word Alignment

Authors: Ehsaneddin Asgari, Masoud Jalili Sabet, Philipp Dufter, Christopher Ringlstetter, Hinrich Schütze

Abstract: Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sen… ▽ More Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with low-resource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method's hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using $5K$ parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of $100K$'s of parallel sentences in existing word-level fast-align/eflomal alignment methods. △ Less

Submitted 15 June, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

arXiv:2010.16407 [pdf, other]

TopicBERT for Energy Efficient Document Classification

Authors: Yatin Chaudhary, Pankaj Gupta, Khushbu Saxena, Vivek Kulkarni, Thomas Runkler, Hinrich Schütze

Abstract: Prior research notes that BERT's computational cost grows quadratically with sequence length thus leading to longer training times, higher GPU memory constraints and carbon emissions. While recent work seeks to address these scalability issues at pre-training, these issues are also prominent in fine-tuning especially for long sequence tasks like document classification. Our work thus focuses on op… ▽ More Prior research notes that BERT's computational cost grows quadratically with sequence length thus leading to longer training times, higher GPU memory constraints and carbon emissions. While recent work seeks to address these scalability issues at pre-training, these issues are also prominent in fine-tuning especially for long sequence tasks like document classification. Our work thus focuses on optimizing the computational cost of fine-tuning for document classification. We achieve this by complementary learning of both topic and language models in a unified framework, named TopicBERT. This significantly reduces the number of self-attention operations - a main performance bottleneck. Consequently, our model achieves a 1.4x ($\sim40\%$) speedup with $\sim40\%$ reduction in $CO_2$ emission while retaining $99.9\%$ performance over 5 datasets. △ Less

Submitted 14 October, 2020; originally announced October 2020.

Comments: EMNLP2020 (Findings): 9 pages, 5 figures, 8 Tables

arXiv:2010.13641 [pdf, other]

Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification

Authors: Timo Schick, Helmut Schmid, Hinrich Schütze

Abstract: A recent approach for few-shot text classification is to convert textual inputs to cloze questions that contain some form of task description, process them with a pretrained language model and map the predicted words to labels. Manually defining this map** between words and labels requires both domain expertise and an understanding of the language model's abilities. To mitigate this issue, we de… ▽ More A recent approach for few-shot text classification is to convert textual inputs to cloze questions that contain some form of task description, process them with a pretrained language model and map the predicted words to labels. Manually defining this map** between words and labels requires both domain expertise and an understanding of the language model's abilities. To mitigate this issue, we devise an approach that automatically finds such a map** given small amounts of training data. For a number of tasks, the map** found by our approach performs almost as well as hand-crafted label-to-word map**s. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: To appear at COLING 2020

arXiv:2010.12684 [pdf, other]

Dynamic Contextualized Word Embeddings

Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language… ▽ More Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets. △ Less

Submitted 8 June, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: ACL 2021

arXiv:2009.13375 [pdf, other]

Identifying Automatically Generated Headlines using Transformers

Authors: Antonis Maronikolakis, Hinrich Schutze, Mark Stevenson

Abstract: False information spread via the internet and social media influences public opinion and user activity, while generative models enable fake content to be generated faster and more cheaply than had previously been possible. In the not so distant future, identifying fake content generated by deep learning models will play a key role in protecting users from misinformation. To this end, a dataset con… ▽ More False information spread via the internet and social media influences public opinion and user activity, while generative models enable fake content to be generated faster and more cheaply than had previously been possible. In the not so distant future, identifying fake content generated by deep learning models will play a key role in protecting users from misinformation. To this end, a dataset containing human and computer-generated headlines was created and a user study indicated that humans were only able to identify the fake headlines in 47.8% of the cases. However, the most accurate automatic approach, transformers, achieved an overall accuracy of 85.7%, indicating that content generated from language models can be filtered out accurately. △ Less

Submitted 25 April, 2021; v1 submitted 28 September, 2020; originally announced September 2020.

Comments: NLP4IF 2021 Proceedings, NAACL 2021

arXiv:2009.07118 [pdf, other]

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

Authors: Timo Schick, Hinrich Schütze

Abstract: When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 ca… ▽ More When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much "greener" in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models. △ Less

Submitted 12 April, 2021; v1 submitted 15 September, 2020; originally announced September 2020.

Comments: Accepted at NAACL2021

arXiv:2007.08426 [pdf, other]

Investigating Pretrained Language Models for Graph-to-Text Generation

Authors: Leonardo F. R. Ribeiro, Martin Schmitt, Hinrich Schütze, Iryna Gurevych

Abstract: Graph-to-text generation aims to generate fluent texts from graph-based data. In this paper, we investigate two recently proposed pretrained language models (PLMs) and analyze the impact of different task-adaptive pretraining strategies for PLMs in graph-to-text generation. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs.… ▽ More Graph-to-text generation aims to generate fluent texts from graph-based data. In this paper, we investigate two recently proposed pretrained language models (PLMs) and analyze the impact of different task-adaptive pretraining strategies for PLMs in graph-to-text generation. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs. We show that the PLMs BART and T5 achieve new state-of-the-art results and that task-adaptive pretraining strategies improve their performance even further. In particular, we report new state-of-the-art BLEU scores of 49.72 on LDC2017T10, 59.70 on WebNLG, and 25.66 on AGENDA datasets - a relative improvement of 31.8%, 4.5%, and 42.4%, respectively. In an extensive analysis, we identify possible reasons for the PLMs' success on graph-to-text tasks. We find evidence that their knowledge about true facts helps them perform well even when the input graph representation is reduced to a simple bag of node and edge labels. △ Less

Submitted 27 September, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: Accepted as a long paper to NLP4ConvAI, EMNLP2021

arXiv:2006.14209 [pdf, ps, other]

doi 10.18653/v1/P19-1034

Automatic Domain Adaptation Outperforms Manual Domain Adaptation for Predicting Financial Outcomes

Authors: Marina Sedinkina, Nikolas Breitkopf, Hinrich Schütze

Abstract: In this paper, we automatically create sentiment dictionaries for predicting financial outcomes. We compare three approaches: (I) manual adaptation of the domain-general dictionary H4N, (ii) automatic adaptation of H4N and (iii) a combination consisting of first manual, then automatic adaptation. In our experiments, we demonstrate that the automatically adapted sentiment dictionary outperforms the… ▽ More In this paper, we automatically create sentiment dictionaries for predicting financial outcomes. We compare three approaches: (I) manual adaptation of the domain-general dictionary H4N, (ii) automatic adaptation of H4N and (iii) a combination consisting of first manual, then automatic adaptation. In our experiments, we demonstrate that the automatically adapted sentiment dictionary outperforms the previous state of the art in predicting the financial outcomes excess return and volatility. In particular, automatic adaptation performs better than manual adaptation. In our analysis, we find that annotation based on an expert's a priori belief about a word's meaning can be incorrect -- annotation should be performed based on the word's contexts in the target domain instead. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Comments: Accepted at ACL2019

arXiv:2006.10909 [pdf, other]

Neural Topic Modeling with Continual Lifelong Learning

Authors: Pankaj Gupta, Yatin Chaudhary, Thomas Runkler, Hinrich Schütze

Abstract: Lifelong learning has recently attracted attention in building machine learning systems that continually accumulate and transfer knowledge to help future learning. Unsupervised topic modeling has been popularly used to discover topics from document collections. However, the application of topic modeling is challenging due to data sparsity, e.g., in a small collection of (short) documents and thus,… ▽ More Lifelong learning has recently attracted attention in building machine learning systems that continually accumulate and transfer knowledge to help future learning. Unsupervised topic modeling has been popularly used to discover topics from document collections. However, the application of topic modeling is challenging due to data sparsity, e.g., in a small collection of (short) documents and thus, generate incoherent topics and sub-optimal document representations. To address the problem, we propose a lifelong learning framework for neural topic modeling that can continuously process streams of document collections, accumulate topics and guide future topic modeling tasks by knowledge transfer from several sources to better deal with the sparse data. In the lifelong process, we particularly investigate jointly: (1) sharing generative homologies (latent topics) over lifetime to transfer prior knowledge, and (2) minimizing catastrophic forgetting to retain the past learning via novel selective data augmentation, co-training and topic regularization approaches. Given a stream of document collections, we apply the proposed Lifelong Neural Topic Modeling (LNTM) framework in modeling three sparse document collections as future tasks and demonstrate improved performance quantified by perplexity, topic coherence and information retrieval task. △ Less

Submitted 27 June, 2023; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted at ICML2020 (13 pages, 11 figures, 9 tables)

arXiv:2006.10632 [pdf, other]

Explainable and Discourse Topic-aware Neural Language Understanding

Authors: Yatin Chaudhary, Hinrich Schütze, Pankaj Gupta

Abstract: Marrying topic models and language models exposes language understanding to a broader source of document-level context beyond sentences via topics. While introducing topical semantics in language models, existing approaches incorporate latent document topic proportions and ignore topical discourse in sentences of the document. This work extends the line of research by additionally introducing an e… ▽ More Marrying topic models and language models exposes language understanding to a broader source of document-level context beyond sentences via topics. While introducing topical semantics in language models, existing approaches incorporate latent document topic proportions and ignore topical discourse in sentences of the document. This work extends the line of research by additionally introducing an explainable topic representation in language understanding, obtained from a set of key terms correspondingly for each latent topic of the proportion. Moreover, we retain sentence-topic associations along with document-topic association by modeling topical discourse for every sentence in the document. We present a novel neural composite language model that exploits both the latent and explainable topics along with topical discourse at sentence-level in a joint learning framework of topic and language models. Experiments over a range of tasks such as language modeling, word sense disambiguation, document classification, retrieval and text generation demonstrate ability of the proposed model in improving language understanding. △ Less

Submitted 27 June, 2023; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted at ICML2020 (13 pages, 2 figures)

arXiv:2006.10413 [pdf, other]

Are Pretrained Language Models Symbolic Reasoners Over Knowledge?

Authors: Nora Kassner, Benno Krojer, Hinrich Schütze

Abstract: How can pretrained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that investigates the causal relation between facts present in training and facts learned by the PLM. For reas… ▽ More How can pretrained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that investigates the causal relation between facts present in training and facts learned by the PLM. For reasoning, we show that PLMs seem to learn to apply some symbolic reasoning rules correctly but struggle with others, including two-hop reasoning. Further analysis suggests that even the application of learned reasoning rules is flawed. For memorization, we identify schema conformity (facts systematically supported by other facts) and frequency as key factors for its success. △ Less

Submitted 10 October, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

Comments: Accepted to CoNLL 2020

arXiv:2006.09242 [pdf, other]

Modeling Graph Structure via Relative Position for Text Generation from Knowledge Graphs

Authors: Martin Schmitt, Leonardo F. R. Ribeiro, Philipp Dufter, Iryna Gurevych, Hinrich Schütze

Abstract: We present Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation. With our novel graph self-attention, the encoding of a node relies on all nodes in the input graph - not only direct neighbors - facilitating the detection of global patterns. We represent the relation between two nodes as the length of the shortest path between them. Graformer learns to weig… ▽ More We present Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation. With our novel graph self-attention, the encoding of a node relies on all nodes in the input graph - not only direct neighbors - facilitating the detection of global patterns. We represent the relation between two nodes as the length of the shortest path between them. Graformer learns to weight these node-node relations differently for different attention heads, thus virtually learning differently connected views of the input graph. We evaluate Graformer on two popular graph-to-text generation benchmarks, AGENDA and WebNLG, where it achieves strong performance while using many fewer parameters than other approaches. △ Less

Submitted 27 April, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: Accepted as a long paper at TextGraphs 2021

arXiv:2005.07979 [pdf, other]

Unsupervised Embedding-based Detection of Lexical Semantic Changes

Authors: Ehsaneddin Asgari, Christoph Ringlstetter, Hinrich Schütze

Abstract: This paper describes EmbLexChange, a system introduced by the "Life-Language" team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is defined as the divergence between the embedding based profiles of word w (calculated with respect to a set of reference words) in the source and the target domains (source and target domains can be simply two time frames… ▽ More This paper describes EmbLexChange, a system introduced by the "Life-Language" team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is defined as the divergence between the embedding based profiles of word w (calculated with respect to a set of reference words) in the source and the target domains (source and target domains can be simply two time frames t1 and t2). The underlying assumption is that the lexical-semantic change of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces. We show that using a resampling framework for the selection of reference words, we can reliably detect lexical-semantic changes in English, German, Swedish, and Latin. EmbLexChange achieved second place in the binary detection of semantic changes in the SemEval-2020. △ Less

Submitted 16 May, 2020; originally announced May 2020.

arXiv:2005.00766 [pdf, other]

BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA

Authors: Nora Kassner, Hinrich Schütze

Abstract: Khandelwal et al. (2020) use a k-nearest-neighbor (kNN) component to improve language model performance. We show that this idea is beneficial for open-domain question answering (QA). To improve the recall of facts encountered during training, we combine BERT (Devlin et al., 2019) with a traditional information retrieval step (IR) and a kNN search over a large datastore of an embedded text collecti… ▽ More Khandelwal et al. (2020) use a k-nearest-neighbor (kNN) component to improve language model performance. We show that this idea is beneficial for open-domain question answering (QA). To improve the recall of facts encountered during training, we combine BERT (Devlin et al., 2019) with a traditional information retrieval step (IR) and a kNN search over a large datastore of an embedded text collection. Our contributions are as follows: i) BERT-kNN outperforms BERT on cloze-style QA by large margins without any further training. ii) We show that BERT often identifies the correct response category (e.g., US city), but only kNN recovers the factually correct answer (e.g., "Miami"). iii) Compared to BERT, BERT-kNN excels for rare facts. iv) BERT-kNN can easily handle facts not covered by BERT's training set, e.g., recent events. △ Less

Submitted 12 October, 2020; v1 submitted 2 May, 2020; originally announced May 2020.

Comments: to appear in EMNLP Findings

arXiv:2005.00672 [pdf, other]

DagoBERT: Generating Derivational Morphology with a Pretrained Language Model

Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

Abstract: Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT's derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms th… ▽ More Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT's derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segmentation crucially impacts BERT's derivational knowledge, suggesting that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used. △ Less

Submitted 7 October, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

arXiv:2005.00396 [pdf, other]

Identifying Necessary Elements for BERT's Multilinguality

Authors: Philipp Dufter, Hinrich Schütze

Abstract: It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilinguality are still somewhat obscure. We aim to identify architectural properties of BERT a… ▽ More It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilinguality are still somewhat obscure. We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual. To allow for fast experimentation we propose an efficient setup with small BERT models trained on a mix of synthetic and natural data. Overall, we identify four architectural and two linguistic elements that influence multilinguality. Based on our insights, we experiment with a multilingual pretraining setup that modifies the masking strategy using VecMap, i.e., unsupervised embedding alignment. Experiments on XNLI with three languages indicate that our findings transfer from our small setup to larger scale settings. △ Less

Submitted 8 February, 2021; v1 submitted 1 May, 2020; originally announced May 2020.

Comments: EMNLP2020 CRV

arXiv:2004.12406 [pdf, other]

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models

Authors: Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, Hinrich Schütze

Abstract: We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be in… ▽ More We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT and RoBERTa on a series of NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred simultaneously. Through intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning. △ Less

Submitted 11 October, 2020; v1 submitted 26 April, 2020; originally announced April 2020.

Comments: EMNLP 2020; MZ and TL contribute equally

arXiv:2004.12198 [pdf, other]

Quantifying the Contextualization of Word Representations with Semantic Class Probing

Authors: Mengjie Zhao, Philipp Dufter, Yadollah Yaghoobzadeh, Hinrich Schütze

Abstract: Pretrained language models have achieved a new state of the art on many NLP tasks, but there are still many open questions about how and why they work so well. We investigate the contextualization of words in BERT. We quantify the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its context… ▽ More Pretrained language models have achieved a new state of the art on many NLP tasks, but there are still many open questions about how and why they work so well. We investigate the contextualization of words in BERT. We quantify the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its contextualized embeddings. Quantifying contextualization helps in understanding and utilizing pretrained language models. We show that top layer representations achieve high accuracy inferring semantic classes; that the strongest contextualization effects occur in the lower layers; that local context is mostly sufficient for semantic class inference; and that top layer representations are more task-specific after finetuning while lower layer representations are more transferable. Finetuning uncovers task related features, but pretrained knowledge is still largely preserved. △ Less

Submitted 11 October, 2020; v1 submitted 25 April, 2020; originally announced April 2020.

Comments: EMNLP Findings 2020

arXiv:2004.08728 [pdf, other]

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

Authors: Masoud Jalili Sabet, Philipp Dufter, François Yvon, Hinrich Schütze

Abstract: Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that re… ▽ More Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners, even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences. △ Less

Submitted 16 April, 2021; v1 submitted 18 April, 2020; originally announced April 2020.

Comments: EMNLP (Findings) 2020

arXiv:2004.03354 [pdf, other]

Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA

Authors: Nina Poerner, Ulli Waltinger, Hinrich Schütze

Abstract: Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO_2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We eva… ▽ More Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO_2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We evaluate on eight biomedical Named Entity Recognition (NER) tasks and compare against the recently proposed BioBERT model. We cover over 60% of the BioBERT-BERT F1 delta, at 5% of BioBERT's CO_2 footprint and 2% of its cloud compute cost. We also show how to quickly adapt an existing general-domain Question Answering (QA) model to an emerging domain: the Covid-19 pandemic. △ Less

Submitted 27 June, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

arXiv:2001.07676 [pdf, other]

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference

Authors: Timo Schick, Hinrich Schütze

Abstract: Some NLP tasks can be solved in a fully unsupervised fashion by providing a pretrained language model with "task descriptions" in natural language (e.g., Radford et al., 2019). While this approach underperforms its supervised counterpart, we show in this work that the two ideas can be combined: We introduce Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates i… ▽ More Some NLP tasks can be solved in a fully unsupervised fashion by providing a pretrained language model with "task descriptions" in natural language (e.g., Radford et al., 2019). While this approach underperforms its supervised counterpart, we show in this work that the two ideas can be combined: We introduce Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. For several tasks and languages, PET outperforms supervised training and strong semi-supervised approaches in low-resource settings by a large margin. △ Less

Submitted 25 January, 2021; v1 submitted 21 January, 2020; originally announced January 2020.

Comments: Accepted at EACL2021

arXiv:2001.02284 [pdf, other]

Multipurpose Intelligent Process Automation via Conversational Assistant

Authors: Alena Moiseeva, Dietrich Trautmann, Michael Heimann, Hinrich Schütze

Abstract: Intelligent Process Automation (IPA) is an emerging technology with a primary goal to assist the knowledge worker by taking care of repetitive, routine and low-cognitive tasks. Conversational agents that can interact with users in a natural language are potential application for IPA systems. Such intelligent agents can assist the user by answering specific questions and executing routine tasks tha… ▽ More Intelligent Process Automation (IPA) is an emerging technology with a primary goal to assist the knowledge worker by taking care of repetitive, routine and low-cognitive tasks. Conversational agents that can interact with users in a natural language are potential application for IPA systems. Such intelligent agents can assist the user by answering specific questions and executing routine tasks that are ordinarily performed in a natural language (i.e., customer support). In this work, we tackle a challenge of implementing an IPA conversational assistant in a real-world industrial setting with a lack of structured training data. Our proposed system brings two significant benefits: First, it reduces repetitive and time-consuming activities and, therefore, allows workers to focus on more intelligent processes. Second, by interacting with users, it augments the resources with structured and to some extent labeled training data. We showcase the usage of the latter by re-implementing several components of our system with Transfer Learning (TL) methods. △ Less

Submitted 21 May, 2020; v1 submitted 7 January, 2020; originally announced January 2020.

Comments: Presented at the AAAI-20 Workshop on Intelligent Process Automation

arXiv:1912.05877 [pdf, other]

Extending Machine Language Models toward Human-Level Language Understanding

Authors: James L. McClelland, Felix Hill, Maja Rudolph, Jason Baldridge, Hinrich Schütze

Abstract: Language is crucial for human intelligence, but what exactly is its role? We take language to be a part of a system for understanding and communicating about situations. The human ability to understand and communicate about situations emerges gradually from experience and depends on domain-general principles of biological neural networks: connection-based learning, distributed representation, and… ▽ More Language is crucial for human intelligence, but what exactly is its role? We take language to be a part of a system for understanding and communicating about situations. The human ability to understand and communicate about situations emerges gradually from experience and depends on domain-general principles of biological neural networks: connection-based learning, distributed representation, and context-sensitive, mutual constraint satisfaction-based processing. Current artificial language processing systems rely on the same domain general principles, embodied in artificial neural networks. Indeed, recent progress in this field depends on \emph{query-based attention}, which extends the ability of these systems to exploit context and has contributed to remarkable breakthroughs. Nevertheless, most current models focus exclusively on language-internal tasks, limiting their ability to perform tasks that depend on understanding situations. These systems also lack memory for the contents of prior situations outside of a fixed contextual span. We describe the organization of the brain's distributed understanding system, which includes a fast learning system that addresses the memory problem. We sketch a framework for future models of understanding drawing equally on cognitive neuroscience and artificial intelligence and exploiting query-based attention. We highlight relevant current directions and consider further developments needed to fully capture human-level language understanding in a computational system. △ Less

Submitted 4 July, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

arXiv:1911.04916 [pdf, ps, other]

Morphological Segmentation Inside-Out

Authors: Ryan Cotterell, Arun Kumar, Hinrich Schütze

Abstract: Morphological segmentation has traditionally been modeled with non-hierarchical models, which yield flat segmentations as output. In many cases, however, proper morphological analysis requires hierarchical structure -- especially in the case of derivational morphology. In this work, we introduce a discriminative, joint model of morphological segmentation along with the orthographic changes that oc… ▽ More Morphological segmentation has traditionally been modeled with non-hierarchical models, which yield flat segmentations as output. In many cases, however, proper morphological analysis requires hierarchical structure -- especially in the case of derivational morphology. In this work, we introduce a discriminative, joint model of morphological segmentation along with the orthographic changes that occur during word formation. To the best of our knowledge, this is the first attempt to approach discriminative segmentation with a context-free model. Additionally, we release an annotated treebank of 7454 English words with constituency parses, encouraging future research in this area. △ Less

Submitted 12 February, 2021; v1 submitted 12 November, 2019; originally announced November 2019.

Comments: EMNLP 2016

arXiv:1911.03700 [pdf, other]

Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity

Authors: Nina Poerner, Ulli Waltinger, Hinrich Schütze

Abstract: We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schütze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015)… ▽ More We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schütze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015) and cross-view auto-encoders (Bollegala and Bao, 2018). Our sentence meta-embeddings set a new unsupervised State of The Art (SoTA) on the STS Benchmark and on the STS12-STS16 datasets, with gains of between 3.7% and 6.4% Pearson's r over single-source systems. △ Less

Submitted 24 June, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

arXiv:1911.03681 [pdf, other]

E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT

Authors: Nina Poerner, Ulli Waltinger, Hinrich Schütze

Abstract: We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT's native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al.… ▽ More We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT's native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al., 2019) and KnowBert (Peters et al., 2019), but it requires no expensive further pretraining of the BERT encoder. We evaluate E-BERT on unsupervised question answering (QA), supervised relation classification (RC) and entity linking (EL). On all three tasks, E-BERT outperforms BERT and other baselines. We also show quantitatively that the original BERT model is overly reliant on the surface form of entity names (e.g., guessing that someone with an Italian-sounding name speaks Italian), and that E-BERT mitigates this problem. △ Less

Submitted 1 May, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

arXiv:1911.03343 [pdf, other]

Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly

Authors: Nora Kassner, Hinrich Schütze

Abstract: Building on Petroni et al. (2019), we propose two new probing tasks analyzing factual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated ("Birds cannot [MASK]") and non-negated ("Birds can [MASK]") cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add "misprimes" to cloze questions ("Talk? Bir… ▽ More Building on Petroni et al. (2019), we propose two new probing tasks analyzing factual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated ("Birds cannot [MASK]") and non-negated ("Birds can [MASK]") cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add "misprimes" to cloze questions ("Talk? Birds can [MASK]"). We find that PLMs are easily distracted by misprimes. These results suggest that PLMs still have a long way to go to adequately learn human-like factual knowledge. △ Less

Submitted 15 May, 2020; v1 submitted 8 November, 2019; originally announced November 2019.

Comments: ACL 2020

arXiv:1910.07181 [pdf, other]

BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

Authors: Timo Schick, Hinrich Schütze

Abstract: Pretraining deep language models has led to large performance gains in NLP. Despite this success, Schick and Schütze (2020) recently showed that these models struggle to understand rare words. For static word embeddings, this problem has been addressed by separately learning representations for rare words. In this work, we transfer this idea to pretrained language models: We introduce BERTRAM, a p… ▽ More Pretraining deep language models has led to large performance gains in NLP. Despite this success, Schick and Schütze (2020) recently showed that these models struggle to understand rare words. For static word embeddings, this problem has been addressed by separately learning representations for rare words. In this work, we transfer this idea to pretrained language models: We introduce BERTRAM, a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are suitable as input representations for deep language models. This is achieved by enabling the surface form and contexts of a word to interact with each other in a deep architecture. Integrating BERTRAM into BERT leads to large performance increases due to improved representations of rare and medium frequency words on both a rare word probing task and three downstream tasks. △ Less

Submitted 29 April, 2020; v1 submitted 16 October, 2019; originally announced October 2019.

Comments: Accepted at ACL2020

arXiv:1910.03385 [pdf, other]

Linguistically Informed Relation Extraction and Neural Architectures for Nested Named Entity Recognition in BioNLP-OST 2019

Authors: Usama Yaseen, Pankaj Gupta, Hinrich Schütze

Abstract: Named Entity Recognition (NER) and Relation Extraction (RE) are essential tools in distilling knowledge from biomedical literature. This paper presents our findings from participating in BioNLP Shared Tasks 2019. We addressed Named Entity Recognition including nested entities extraction, Entity Normalization and Relation Extraction. Our proposed approach of Named Entities can be generalized to dif… ▽ More Named Entity Recognition (NER) and Relation Extraction (RE) are essential tools in distilling knowledge from biomedical literature. This paper presents our findings from participating in BioNLP Shared Tasks 2019. We addressed Named Entity Recognition including nested entities extraction, Entity Normalization and Relation Extraction. Our proposed approach of Named Entities can be generalized to different languages and we have shown it's effectiveness for English and Spanish text. We investigated linguistic features, hybrid loss including ranking and Conditional Random Fields (CRF), multi-task objective and token-level ensembling strategy to improve NER. We employed dictionary based fuzzy and semantic search to perform Entity Normalization. Finally, our RE system employed Support Vector Machine (SVM) with linguistic features. Our NER submission (team:MIC-CIS) ranked first in BB-2019 norm+NER task with standard error rate (SER) of 0.7159 and showed competitive performance on PharmaCo NER task with F1-score of 0.8662. Our RE system ranked first in the SeeDev-binary Relation Extraction Task with F1-score of 0.3738. △ Less

Submitted 8 October, 2019; originally announced October 2019.

Comments: EMNLP 2019, 11 pages, 4 figures, 8 tables

arXiv:1910.00546 [pdf, other]

doi 10.1613/jair.1.11725

Type-aware Convolutional Neural Networks for Slot Filling

Authors: Heike Adel, Hinrich Schütze

Abstract: The slot filling task aims at extracting answers for queries about entities from text, such as "Who founded Apple". In this paper, we focus on the relation classification component of a slot filling system. We propose type-aware convolutional neural networks to benefit from the mutual dependencies between entity and relation classification. In particular, we explore different ways of integrating t… ▽ More The slot filling task aims at extracting answers for queries about entities from text, such as "Who founded Apple". In this paper, we focus on the relation classification component of a slot filling system. We propose type-aware convolutional neural networks to benefit from the mutual dependencies between entity and relation classification. In particular, we explore different ways of integrating the named entity types of the relation arguments into a neural network for relation classification, including a joint training and a structured prediction approach. To the best of our knowledge, this is the first study on type-aware neural networks for slot filling. The type-aware models lead to the best results of our slot filling pipeline. Joint training performs comparable to structured prediction. To understand the impact of the different components of the slot filling pipeline, we perform a recall analysis, a manual error analysis and several ablation studies. Such analyses are of particular importance to other slot filling researchers since the official slot filling evaluations only assess pipeline outputs. The analyses show that especially coreference resolution and our convolutional neural networks have a large positive impact on the final performance of the slot filling pipeline. The presented models, the source code of our system as well as our coreference resource is publicy available. △ Less

Submitted 1 October, 2019; originally announced October 2019.

Comments: Journal of Artificial Intelligence Research (JAIR), volume 66

arXiv:1910.00314 [pdf, other]

BioNLP-OST 2019 RDoC Tasks: Multi-grain Neural Relevance Ranking Using Topics and Attention Based Query-Document-Sentence Interactions

Authors: Yatin Chaudhary, Pankaj Gupta, Hinrich Schütze

Abstract: This paper presents our system details and results of participation in the RDoC Tasks of BioNLP-OST 2019. Research Domain Criteria (RDoC) construct is a multi-dimensional and broad framework to describe mental health disorders by combining knowledge from genomics to behaviour. Non-availability of RDoC labelled dataset and tedious labelling process hinders the use of RDoC framework to reach its ful… ▽ More This paper presents our system details and results of participation in the RDoC Tasks of BioNLP-OST 2019. Research Domain Criteria (RDoC) construct is a multi-dimensional and broad framework to describe mental health disorders by combining knowledge from genomics to behaviour. Non-availability of RDoC labelled dataset and tedious labelling process hinders the use of RDoC framework to reach its full potential in Biomedical research community and Healthcare industry. Therefore, Task-1 aims at retrieval and ranking of PubMed abstracts relevant to a given RDoC construct and Task-2 aims at extraction of the most relevant sentence from a given PubMed abstract. We investigate (1) attention based supervised neural topic model and SVM for retrieval and ranking of PubMed abstracts and, further utilize BM25 and other relevance measures for re-ranking, (2) supervised and unsupervised sentence ranking models utilizing multi-view representations comprising of query-aware attention-based sentence representation (QAR), bag-of-words (BoW) and TF-IDF. Our best systems achieved 1st rank and scored 0.86 mean average precision (mAP) and 0.58 macro average accuracy (MAA) in Task-1 and Task-2 respectively. △ Less

Submitted 2 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

Comments: EMNLP2019, 10 pages, 2 figures, 7 tables

arXiv:1909.06563 [pdf, other]

Multi-view and Multi-source Transfers in Neural Topic Modeling with Pretrained Topic and Word Embeddings

Authors: Pankaj Gupta, Yatin Chaudhary, Hinrich Schütze

Abstract: Though word embeddings and topics are complementary representations, several past works have only used pre-trained word embeddings in (neural) topic modeling to address data sparsity problem in short text or small collection of documents. However, no prior work has employed (pre-trained latent) topics in transfer learning paradigm. In this paper, we propose an approach to (1) perform knowledge tra… ▽ More Though word embeddings and topics are complementary representations, several past works have only used pre-trained word embeddings in (neural) topic modeling to address data sparsity problem in short text or small collection of documents. However, no prior work has employed (pre-trained latent) topics in transfer learning paradigm. In this paper, we propose an approach to (1) perform knowledge transfer using latent topics obtained from a large source corpus, and (2) jointly transfer knowledge via the two representations (or views) in neural topic modeling to improve topic quality, better deal with polysemy and data sparsity issues in a target corpus. In doing so, we first accumulate topics and word representations from one or many source corpora to build a pool of topics and word vectors. Then, we identify one or multiple relevant source domain(s) and take advantage of corresponding topics and word features via the respective pools to guide meaningful learning in the sparse target domain. We quantify the quality of topic and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. We have demonstrated the state-of-the-art results on topic modeling with the proposed framework. △ Less

Submitted 17 September, 2019; v1 submitted 14 September, 2019; originally announced September 2019.

arXiv:1909.06162 [pdf, other]

Neural Architectures for Fine-Grained Propaganda Detection in News

Authors: Pankaj Gupta, Khushbu Saxena, Usama Yaseen, Thomas Runkler, Hinrich Schütze

Abstract: This paper describes our system (MIC-CIS) details and results of participation in the fine-grained propaganda detection shared task 2019. To address the tasks of sentence (SLC) and fragment level (FLC) propaganda detection, we explore different neural architectures (e.g., CNN, LSTM-CRF and BERT) and extract linguistic (e.g., part-of-speech, named entity, readability, sentiment, emotion, etc.), lay… ▽ More This paper describes our system (MIC-CIS) details and results of participation in the fine-grained propaganda detection shared task 2019. To address the tasks of sentence (SLC) and fragment level (FLC) propaganda detection, we explore different neural architectures (e.g., CNN, LSTM-CRF and BERT) and extract linguistic (e.g., part-of-speech, named entity, readability, sentiment, emotion, etc.), layout and topical features. Specifically, we have designed multi-granularity and multi-tasking neural architectures to jointly perform both the sentence and fragment level propaganda detection. Additionally, we investigate different ensemble schemes such as majority-voting, relax-voting, etc. to boost overall system performance. Compared to the other participating systems, our submissions are ranked 3rd and 4th in FLC and SLC tasks, respectively. △ Less

Submitted 13 September, 2019; originally announced September 2019.

Comments: EMNLP2019: Fine-grained propaganda detection shared task at NLP4IF workshop (EMNLP2019)

arXiv:1907.02423 [pdf, other]

Morphological Word Embeddings

Authors: Ryan Cotterell, Hinrich Schütze

Abstract: Linguistic similarity is multi-faceted. For instance, two words may be similar with respect to semantics, syntax, or morphology inter alia. Continuous word-embeddings have been shown to capture most of these shades of similarity to some degree. This work considers guiding word-embeddings with morphologically annotated data, a form of semi-supervised learning, encouraging the vectors to encode a wo… ▽ More Linguistic similarity is multi-faceted. For instance, two words may be similar with respect to semantics, syntax, or morphology inter alia. Continuous word-embeddings have been shown to capture most of these shades of similarity to some degree. This work considers guiding word-embeddings with morphologically annotated data, a form of semi-supervised learning, encouraging the vectors to encode a word's morphology, i.e., words close in the embedded space share morphological features. We extend the log-bilinear model to this end and show that indeed our learned embeddings achieve this, using German as a case study. △ Less

Submitted 4 July, 2019; originally announced July 2019.

Comments: Published at NAACL 2015

Showing 101–150 of 220 results for author: Schütze, H