Skip to main content

Showing 51–100 of 220 results for author: Schütze, H

.
  1. arXiv:2305.08487  [pdf, other

    cs.CL

    Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

    Authors: Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Renhao Pei, Ehsaneddin Asgari, Hinrich Schütze

    Abstract: While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompas… ▽ More

    Submitted 4 June, 2024; v1 submitted 15 May, 2023; originally announced May 2023.

  2. arXiv:2305.08475  [pdf, other

    cs.CL

    A Crosslingual Investigation of Conceptualization in 1335 Languages

    Authors: Yihong Liu, Haotian Ye, Leonie Weissweiler, Philipp Wicke, Renhao Pei, Robert Zangenfeind, Hinrich Schütze

    Abstract: Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for `belly' and `womb'. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language… ▽ More

    Submitted 26 May, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  3. NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis

    Authors: Mingyang Wang, Heike Adel, Lukas Lange, Jannik Strötgen, Hinrich Schütze

    Abstract: This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African Languages using Twitter Dataset". Sentiment analysis is one of the most widely studied applications in natural language processing. However, most prior work still focuses on a small number of high-resource languages. Building reliable sentiment analysis systems for low-resource langu… ▽ More

    Submitted 28 April, 2023; originally announced May 2023.

  4. arXiv:2304.10158  [pdf, other

    cs.CL

    Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages

    Authors: Verena Blaschke, Hinrich Schütze, Barbara Plank

    Abstract: One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

    Comments: VarDial 2023

  5. arXiv:2304.09805  [pdf, other

    cs.CL

    A Survey of Corpora for Germanic Low-Resource Languages and Dialects

    Authors: Verena Blaschke, Hinrich Schütze, Barbara Plank

    Abstract: Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

    Comments: NoDaLiDa 2023

  6. arXiv:2304.08460  [pdf, other

    cs.CL cs.AI cs.LG

    LongForm: Effective Instruction Tuning with Reverse Instructions

    Authors: Abdullatif Köksal, Timo Schick, Anna Korhonen, Hinrich Schütze

    Abstract: Instruction tuning enables language models to more effectively generalize and better follow user intent. However, obtaining instruction data is costly and challenging. Prior work employs methods such as expensive human annotation, crowd-sourced datasets with alignment issues, and generating noisy examples via LLMs. We introduce the LongForm-C dataset, which is created by reverse instructions. We g… ▽ More

    Submitted 14 February, 2024; v1 submitted 17 April, 2023; originally announced April 2023.

    Comments: This version extends the evaluation with new metrics and NLU tasks

  7. arXiv:2304.01890  [pdf, other

    cs.CL cs.AI cs.LG

    Sociocultural knowledge is needed for selection of shots in hate speech detection tasks

    Authors: Antonis Maronikolakis, Abdullatif Köksal, Hinrich Schütze

    Abstract: We introduce HATELEXICON, a lexicon of slurs and targets of hate speech for the countries of Brazil, Germany, India and Kenya, to aid training and interpretability of models. We demonstrate how our lexicon can be used to interpret model predictions, showing that models developed to classify extreme speech rely heavily on target words when making predictions. Further, we propose a method to aid sho… ▽ More

    Submitted 17 May, 2023; v1 submitted 4 April, 2023; originally announced April 2023.

  8. arXiv:2303.09236  [pdf, other

    cs.SE

    GIRT-Data: Sampling GitHub Issue Report Templates

    Authors: Nafiseh Nikeghbal, Amir Hossein Kargaran, Abbas Heydarnoori, Hinrich Schütze

    Abstract: GitHub's issue reports provide developers with valuable information that is essential to the evolution of a software development project. Contributors can use these reports to perform software engineering tasks like submitting bugs, requesting features, and collaborating on ideas. In the initial versions of issue reports, there was no standard way of using them. As a result, the quality of issue r… ▽ More

    Submitted 21 March, 2023; v1 submitted 16 March, 2023; originally announced March 2023.

    Comments: Accepted to be published at the 20th IEEE/ACM International Conference on Mining Software Repositories (MSR 2023)

  9. arXiv:2303.04496  [pdf, other

    cs.CL cs.AI cs.HC

    MenuCraft: Interactive Menu System Design with Large Language Models

    Authors: Amir Hossein Kargaran, Nafiseh Nikeghbal, Abbas Heydarnoori, Hinrich Schütze

    Abstract: Menu system design is a challenging task involving many design options and various human factors. For example, one crucial factor that designers need to consider is the semantic and systematic relation of menu commands. However, capturing these relations can be challenging due to limited available resources. With the advancement of neural language models, large language models can utilize their va… ▽ More

    Submitted 23 July, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

  10. arXiv:2302.02178  [pdf, other

    cs.CL

    Construction Grammar Provides Unique Insight into Neural Language Models

    Authors: Leonie Weissweiler, Taiqi He, Naoki Otani, David R. Mortensen, Lori Levin, Hinrich Schütze

    Abstract: Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mi… ▽ More

    Submitted 4 February, 2023; originally announced February 2023.

    Comments: GURT 2023

  11. arXiv:2212.09651  [pdf, other

    cs.CL

    Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

    Authors: Ercong Nie, Sheng Liang, Helmut Schmid, Hinrich Schütze

    Abstract: Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource langu… ▽ More

    Submitted 10 July, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted to Findings of ACL 2023

  12. arXiv:2212.09086  [pdf, other

    cs.CL

    PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism

    Authors: Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze

    Abstract: We investigate response generation for multi-turn dialogue in generative-based chatbots. Existing generative models based on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the sequences, which makes models unable to capture the subtle variability observed in different dialogues and cannot distinguish the differences between dialogues that are similar in composit… ▽ More

    Submitted 16 May, 2023; v1 submitted 18 December, 2022; originally announced December 2022.

    Comments: ACL2023 main conference

  13. arXiv:2212.07547  [pdf, other

    cs.CL cs.AI cs.SI

    Unsupervised Detection of Contextualized Embedding Bias with Application to Ideology

    Authors: Valentin Hofmann, Janet B. Pierrehumbert, Hinrich Schütze

    Abstract: We propose a fully unsupervised method to detect bias in contextualized embeddings. The method leverages the assortative information latently encoded by social networks and combines orthogonality regularization, structured sparsity learning, and graph neural networks to find the embedding subspace capturing this information. As a concrete example, we focus on the phenomenon of ideological bias: we… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

    Comments: ICML 2022

  14. arXiv:2211.08358  [pdf, other

    cs.CL

    MEAL: Stable and Active Learning for Few-Shot Prompting

    Authors: Abdullatif Köksal, Timo Schick, Hinrich Schütze

    Abstract: Few-shot classification has made great strides due to foundation models that, through priming and prompting, are highly effective few-shot learners. However, this approach has high variance both across different sets of few shots (data selection) and across different finetuning runs (run variability). This is problematic not only because it impedes the fair comparison of different approaches, but… ▽ More

    Submitted 20 November, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: EMNLP 2023 Findings

  15. arXiv:2210.13985  [pdf, other

    cs.CL cs.CY

    This joke is [MASK]: Recognizing Humor and Offense with Prompting

    Authors: Junze Li, Mengjie Zhao, Yubo Xie, Antonis Maronikolakis, Pearl Pu, Hinrich Schütze

    Abstract: Humor is a magnetic component in everyday human interactions and communications. Computationally modeling humor enables NLP systems to entertain and engage with users. We investigate the effectiveness of prompting, a new transfer learning paradigm for NLP, for humor recognition. We show that prompting performs similarly to finetuning when numerous annotations are available, but gives stellar perfo… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: Transfer Learning for Natural Language Processing Workshop at NeurIPS 2022

  16. arXiv:2210.13181  [pdf, other

    cs.CL

    The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative

    Authors: Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, Hinrich Schütze

    Abstract: Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntact… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  17. arXiv:2210.09840  [pdf, other

    cs.CL

    Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

    Authors: Ayyoob Imani, Silvia Severini, Masoud Jalili Sabet, François Yvon, Hinrich Schütze

    Abstract: Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-reso… ▽ More

    Submitted 31 October, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  18. arXiv:2210.06207  [pdf, other

    cs.CL

    SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

    Authors: Abdullatif Köksal, Silvia Severini, Hinrich Schütze

    Abstract: Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance… ▽ More

    Submitted 27 March, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

  19. arXiv:2210.06101  [pdf, other

    cs.CL cs.AI

    Federated Continual Learning for Text Classification via Selective Inter-client Transfer

    Authors: Yatin Chaudhary, Pranav Rai, Matthias Schubert, Hinrich Schütze, Pankaj Gupta

    Abstract: In this work, we combine the two paradigms: Federated Learning (FL) and Continual Learning (CL) for text classification task in cloud-edge continuum. The objective of Federated Continual Learning (FCL) is to improve deep learning models over life time at each client by (relevant and efficient) knowledge transfer without sharing data. Here, we address challenges in minimizing inter-client interfere… ▽ More

    Submitted 12 February, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: EMNLP2022 (Findings): 11 pages, 5 figures, 4 tables

  20. arXiv:2209.12495  [pdf, other

    cs.CL

    Modeling Content-Emotion Duality via Disentanglement for Empathetic Conversation

    Authors: Peiqin Lin, Jiashuo Wang, Hinrich Schütze, Wenjie Li

    Abstract: The task of empathetic response generation aims to understand what feelings a speaker expresses on his/her experiences and then reply to the speaker appropriately. To solve the task, it is essential to model the content-emotion duality of a dialogue, which is composed of the content view (i.e., what personal experiences are described) and the emotion view (i.e., the feelings of the speaker on thes… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

  21. arXiv:2207.14251  [pdf, other

    cs.CL

    Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

    Authors: Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, Yoav Goldberg

    Abstract: Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain exp… ▽ More

    Submitted 24 March, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

    Comments: We received a criticism regarding the validity of the causal formulation in this paper. We will address them in an upcoming version

  22. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  23. arXiv:2205.15713  [pdf, other

    cs.CL

    Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings

    Authors: Silvia Severini, Viktor Hangya, Masoud Jalili Sabet, Alexander Fraser, Hinrich Schütze

    Abstract: Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, w… ▽ More

    Submitted 31 May, 2022; originally announced May 2022.

    Comments: BUCC@LREC 2022

  24. arXiv:2205.06621  [pdf, other

    cs.CL cs.AI cs.LG

    Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes

    Authors: Antonis Maronikolakis, Philip Baader, Hinrich Schütze

    Abstract: To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets… ▽ More

    Submitted 18 May, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

    Comments: Accepted at "4th Workshop on Gender Bias in Natural Language Processing", NAACL 2022

  25. arXiv:2204.12225  [pdf, other

    cs.CL

    Flow-Adapter Architecture for Unsupervised Machine Translation

    Authors: Yihong Liu, Haris Jabbar, Hinrich Schütze

    Abstract: In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: ACL 2022

  26. arXiv:2203.16926  [pdf, other

    cs.CL

    Domain Adaptation for Sparse-Data Settings: What Do We Gain by Not Using Bert?

    Authors: Marina Sedinkina, Martin Schmitt, Hinrich Schütze

    Abstract: The practical success of much of NLP depends on the availability of training data. However, in real-world scenarios, training data is often scarce, not least because many application domains are restricted and specific. In this work, we compare different methods to handle this problem and provide guidelines for building NLP applications when there is only a small amount of labeled training data av… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

  27. arXiv:2203.11764  [pdf, other

    cs.CL cs.AI

    Listening to Affected Communities to Define Extreme Speech: Dataset and Experiments

    Authors: Antonis Maronikolakis, Axel Wisiorek, Leah Nann, Haris Jabbar, Sahana Udupa, Hinrich Schuetze

    Abstract: Building on current work on multilingual hate speech (e.g., Ousidhoum et al. (2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya. The key novelty is that we directly involve the affected communities in collecting and annotating the data - as opposed to giving co… ▽ More

    Submitted 22 March, 2022; originally announced March 2022.

    Comments: Accepted to ACL 2022 Findings

  28. arXiv:2203.10010  [pdf, other

    cs.CL

    CaMEL: Case Marker Extraction without Labels

    Authors: Leonie Weissweiler, Valentin Hofmann, Masoud Jalili Sabet, Hinrich Schütze

    Abstract: We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a s… ▽ More

    Submitted 28 March, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  29. arXiv:2203.09590  [pdf, other

    cs.CL cs.LG

    ECOLA: Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations

    Authors: Zhen Han, Ruotong Liao, **dong Gu, Yao Zhang, Zifeng Ding, Yujia Gu, Heinz Köppl, Hinrich Schütze, Volker Tresp

    Abstract: Since conventional knowledge embedding models cannot take full advantage of the abundant textual information, there have been extensive research efforts in enhancing knowledge embedding using texts. However, existing enhancement approaches cannot apply to temporal knowledge graphs (tKGs), which contain time-dependent event knowledge with complex temporal dynamics. Specifically, existing enhancemen… ▽ More

    Submitted 4 May, 2023; v1 submitted 17 March, 2022; originally announced March 2022.

    Comments: accepted to Findings of the ACL 2023

  30. arXiv:2203.08654  [pdf, other

    cs.CL

    Graph Neural Networks for Multiparallel Word Alignment

    Authors: Ayyoob Imani, Lütfi Kerem Şenel, Masoud Jalili Sabet, François Yvon, Hinrich Schütze

    Abstract: After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection, and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. Here, we compute high-quality word alignments between multiple language pa… ▽ More

    Submitted 10 August, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

    Report number: ACL 2022 Findings

  31. arXiv:2203.08565  [pdf, other

    cs.CL

    Geographic Adaptation of Pretrained Language Models

    Authors: Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze

    Abstract: While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce ge… ▽ More

    Submitted 28 January, 2024; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: TACL 2024 (pre-MIT Press publication version)

  32. arXiv:2203.08257  [pdf, other

    cs.CL

    Differentiable Multi-Agent Actor-Critic for Multi-Step Radiology Report Summarization

    Authors: Sanjeev Kumar Karn, Ning Liu, Hinrich Schuetze, Oladimeji Farri

    Abstract: The IMPRESSIONS section of a radiology report about an imaging study is a summary of the radiologist's reasoning and conclusions, and it also aids the referring physician in confirming or excluding certain diagnoses. A cascade of tasks are required to automatically generate an abstractive summary of the typical information-rich radiology report. These tasks include acquisition of salient content f… ▽ More

    Submitted 29 April, 2022; v1 submitted 15 March, 2022; originally announced March 2022.

    Comments: Accepted at 60th Annual Meeting of the Association for Computational Linguistics 2022 Main Conference

    Journal ref: 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 2022

  33. arXiv:2203.08055  [pdf, other

    cs.CL

    Modular and Parameter-Efficient Multimodal Fusion with Prompting

    Authors: Sheng Liang, Mengjie Zhao, Hinrich Schütze

    Abstract: Recent research has made impressive progress in large-scale multimodal pre-training. In the context of the rapid growth of model size, it is necessary to seek efficient and flexible methods other than finetuning. In this paper, we propose to use prompt vectors to align the modalities. Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings. We… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

    Comments: Accepted to Findings of ACL 2022

  34. arXiv:2203.06228  [pdf, other

    cs.CL cs.AI

    CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment

    Authors: Lütfi Kerem Senel, Timo Schick, Hinrich Schütze

    Abstract: Pretrained language models (PLMs) have achieved superhuman performance on many benchmarks, creating a need for harder tasks. We introduce CoDA21 (Context Definition Alignment), a challenging benchmark that measures natural language understanding (NLU) capabilities of PLMs: Given a definition and a context each for k words, but not the words themselves, the task is to align the k definitions with t… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: To appear in ACL 2022, 5 pages, 2 figures

  35. arXiv:2202.06133  [pdf, other

    cs.CL

    Semantic-Oriented Unlabeled Priming for Large-Scale Language Models

    Authors: Yanchen Liu, Timo Schick, Hinrich Schütze

    Abstract: Due to the high costs associated with finetuning large language models, various recent works propose to adapt them to specific tasks without any parameter updates through in-context learning. Unfortunately, for in-context learning there is currently no way to leverage unlabeled data, which is often much easier to obtain in large quantities than labeled examples. In this work, we therefore investig… ▽ More

    Submitted 12 February, 2022; originally announced February 2022.

  36. arXiv:2201.12219  [pdf, other

    cs.CL

    Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

    Authors: Silvia Severini, Ayyoob Imani, Philipp Dufter, Hinrich Schütze

    Abstract: Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE re… ▽ More

    Submitted 29 April, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: LREC 2022

  37. arXiv:2112.07522  [pdf, other

    cs.CL

    LMTurk: Few-Shot Learners as Crowdsourcing Workers in a Language-Model-as-a-Service Framework

    Authors: Mengjie Zhao, Fei Mi, Yasheng Wang, Minglei Li, Xin Jiang, Qun Liu, Hinrich Schütze

    Abstract: Vast efforts have been devoted to creating high-performance few-shot learners, i.e., large-scale pretrained language models (PLMs) that perform well with little downstream task training data. Training PLMs has incurred significant cost, but utilizing the few-shot learners is still challenging due to their enormous size. This work focuses on a crucial question: How to make effective use of these fe… ▽ More

    Submitted 2 May, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: Findings of ACL: NAACL 2022

  38. arXiv:2111.13440  [pdf, other

    cs.CL

    True Few-Shot Learning with Prompts -- A Real-World Perspective

    Authors: Timo Schick, Hinrich Schütze

    Abstract: Prompt-based approaches are strong at few-shot learning. However, Perez et al. (2021) have recently cast doubt on their performance because they had difficulty getting good results in a "true" few-shot setting in which prompts and hyperparameters cannot be tuned on a dev set. In view of this, we conduct an extensive study of PET, a method that combines textual instructions with example-based finet… ▽ More

    Submitted 26 November, 2021; originally announced November 2021.

  39. arXiv:2109.14723  [pdf, other

    cs.CL

    BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief

    Authors: Nora Kassner, Oyvind Tafjord, Hinrich Schütze, Peter Clark

    Abstract: Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually "believes" about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our appro… ▽ More

    Submitted 29 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021 Camera Ready. arXiv admin note: substantial text overlap with arXiv:2104.08401

  40. arXiv:2109.13611  [pdf, other

    cs.CL

    Active Learning for Argument Mining: A Practical Approach

    Authors: Nikolai Solmsdorf, Dietrich Trautmann, Hinrich Schütze

    Abstract: Despite considerable recent progress, the creation of well-balanced and diverse resources remains a time-consuming and costly challenge in Argument Mining. Active Learning reduces the amount of data necessary for the training of machine learning models by querying the most informative samples for annotation and therefore is a promising method for resource creation. In a large scale comparison of s… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

  41. arXiv:2109.11398  [pdf, other

    cs.CV

    Scene Graph Generation for Better Image Captioning?

    Authors: Maximilian Mozes, Martin Schmitt, Vladimir Golkov, Hinrich Schütze, Daniel Cremers

    Abstract: We investigate the incorporation of visual relationships into the task of supervised image caption generation by proposing a model that leverages detected objects and auto-generated visual relationships to describe images in natural language. To do so, we first generate a scene graph from raw image pixels by identifying individual objects and visual relationships between them. This scene graph the… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

    Comments: Technical report. This work was done and the paper was written in 2019

  42. arXiv:2109.09700  [pdf, other

    cs.CL cs.AI

    BERT Cannot Align Characters

    Authors: Antonis Maronikolakis, Philipp Dufter, Hinrich Schütze

    Abstract: In previous work, it has been shown that BERT can adequately align cross-lingual sentences on the word level. Here we investigate whether BERT can also operate as a char-level aligner. The languages examined are English, Fake-English, German and Greek. We show that the closer two languages are, the better BERT can align them on the character level. BERT indeed works well in English to Fake-English… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

    Comments: Second Workshop on Insights from Negative Results, EMNLP 2021

  43. arXiv:2109.08040  [pdf, other

    cs.CL

    Locating Language-Specific Information in Contextualized Embeddings

    Authors: Sheng Liang, Philipp Dufter, Hinrich Schütze

    Abstract: Multilingual pretrained language models (MPLMs) exhibit multilinguality and are well suited for transfer across languages. Most MPLMs are trained in an unsupervised fashion and the relationship between their objective and multilinguality is unclear. More specifically, the question whether MPLM representations are language-agnostic or they simply interleave well with learned task prediction heads a… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

  44. arXiv:2109.06283  [pdf, other

    cs.CL

    Graph Algorithms for Multiparallel Word Alignment

    Authors: Ayyoob Imani, Masoud Jalili Sabet, Lütfi Kerem Şenel, Philipp Dufter, François Yvon, Hinrich Schütze

    Abstract: With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilin… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  45. arXiv:2109.05772  [pdf, other

    cs.CL

    Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages

    Authors: Antonis Maronikolakis, Philipp Dufter, Hinrich Schütze

    Abstract: The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflec… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted at EMNLP 2021 Findings

  46. arXiv:2109.03695  [pdf, other

    cs.CL cs.AI

    Continuous Entailment Patterns for Lexical Inference in Context

    Authors: Martin Schmitt, Hinrich Schütze

    Abstract: Combining a pretrained language model (PLM) with textual patterns has been shown to help in both zero- and few-shot settings. For zero-shot performance, it makes sense to design patterns that closely resemble the text seen during self-supervised pretraining because the model has never seen anything else. Supervised training allows for more flexibility. If we allow for tokens outside the PLM's voca… ▽ More

    Submitted 8 September, 2021; originally announced September 2021.

    Comments: Accepted as a short paper at EMNLP 2021. Code available at https://github.com/mnschmit/conan

  47. arXiv:2109.03630  [pdf, other

    cs.CL

    Discrete and Soft Prompting for Multilingual Models

    Authors: Mengjie Zhao, Hinrich Schütze

    Abstract: It has been shown for English that discrete and soft prompting perform strongly in few-shot learning with pretrained language models (PLMs). In this paper, we show that discrete and soft prompting perform better than finetuning in multilingual cases: Crosslingual transfer and in-language training of multilingual natural language inference. For example, with 48 English training examples, finetuning… ▽ More

    Submitted 8 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  48. arXiv:2109.02050  [pdf, other

    cs.SE

    Semi-Automated Labeling of Requirement Datasets for Relation Extraction

    Authors: Jeremias Bohn, Jannik Fischbach, Martin Schmitt, Hinrich Schütze, Andreas Vogelsang

    Abstract: Creating datasets manually by human annotators is a laborious task that can lead to biased and inhomogeneous labels. We propose a flexible, semi-automatic framework for labeling data for relation extraction. Furthermore, we provide a dataset of preprocessed sentences from the requirements engineering domain, including a set of automatically created as well as hand-crafted labels. In our case study… ▽ More

    Submitted 5 September, 2021; originally announced September 2021.

  49. arXiv:2107.06632  [pdf, other

    cs.CL

    ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

    Authors: Ayyoob Imani, Masoud Jalili Sabet, Philipp Dufter, Michael Cysouw, Hinrich Schütze

    Abstract: With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating reso… ▽ More

    Submitted 15 July, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

    Comments: The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

  50. arXiv:2107.00927  [pdf, other

    cs.CL cs.LG

    Data Centric Domain Adaptation for Historical Text with OCR Errors

    Authors: Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth, Hinrich Schütze

    Abstract: We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general appro… ▽ More

    Submitted 2 July, 2021; originally announced July 2021.

    Comments: 14 pages, 2 figures, 6 tables