Skip to main content

Showing 1–50 of 218 results for author: Schütze, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00436  [pdf, other

    cs.CL

    A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

    Authors: Peiqin Lin, André F. T. Martins, Hinrich Schütze

    Abstract: Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  2. arXiv:2406.19759  [pdf, other

    cs.CL

    Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

    Authors: Orgest Xhelili, Yihong Liu, Hinrich Schütze

    Abstract: Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: preprint

  3. arXiv:2406.18708  [pdf, other

    cs.LG cs.CL

    Learn it or Leave it: Module Composition and Pruning for Continual Learning

    Authors: Mingyang Wang, Heike Adel, Lukas Lange, Jannik Strötgen, Hinrich Schütze

    Abstract: In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facil… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  4. arXiv:2406.17764  [pdf, other

    cs.CL cs.AI

    BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning

    Authors: Ercong Nie, Bo Shao, Zifeng Ding, Mingyang Wang, Helmut Schmid, Hinrich Schütze

    Abstract: Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learn… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 12 pages, 4 figures

  5. arXiv:2406.12443  [pdf, other

    cs.RO

    Robustness Testing of Multi-Modal Models in Varied Home Environments for Assistive Robots

    Authors: Lea Hirlimann, Shengqiang Zhang, Hinrich Schütze, Philipp Wicke

    Abstract: The development of assistive robotic agents to support household tasks is advancing, yet the underlying models often operate in virtual settings that do not reflect real-world complexity. For assistive care robots to be effective in diverse environments, their models must be robust and integrate multiple modalities. Consider a caretaker needing assistance in a dimly lit room or navigating around a… ▽ More

    Submitted 19 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: Geriatronics Summit 2024, July 09 - 10, Garmisch-Partenkirchen Congress Center

  6. arXiv:2406.09881  [pdf, other

    cs.CL

    A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation

    Authors: Yongkang Liu, Ercong Nie, Shi Feng, Zheng Hua, Zifeng Ding, Daling Wang, Yifei Zhang, Hinrich Schütze

    Abstract: Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbf{A}ugmentation framework for \textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred to as \textbf{AMD$^2$G}. The AMD… ▽ More

    Submitted 28 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: 17pages,ECML-PKDD

    Journal ref: 2024 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

  7. arXiv:2406.06263  [pdf, other

    cs.CL

    MaskLID: Code-Switching Language Identification through Iterative Masking

    Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze

    Abstract: We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: ACL 2024

  8. arXiv:2405.18308  [pdf, other

    cs.CL

    Joint Lemmatization and Morphological Tagging with LEMMING

    Authors: Thomas Muller, Ryan Cotterell, Alexander Fraser, Hinrich Schütze

    Abstract: We present LEMMING, a modular log-linear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czech… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: EMNLP 2015; Honorable Mention for Best Short Paper

  9. arXiv:2405.09913  [pdf, other

    cs.CL

    TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data

    Authors: Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Schütze

    Abstract: Transliterating related languages that use different scripts into a common script shows effectiveness in improving crosslingual transfer in downstream tasks. However, this methodology often makes pretraining a model from scratch unavoidable, as transliteration brings about new subwords not covered in existing multilingual pretrained language models (mPLMs). This is not desired because it takes a l… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: preprint

  10. arXiv:2405.05116  [pdf, other

    cs.CL

    XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples

    Authors: Peiqin Lin, André F. T. Martins, Hinrich Schütze

    Abstract: Recent studies indicate that leveraging off-the-shelf or fine-tuned retrievers, capable of retrieving relevant in-context examples tailored to the input query, enhances few-shot in-context learning of English. However, adapting these methods to other languages, especially low-resource ones, poses challenges due to the scarcity of cross-lingual retrievers and annotated data. Thus, we introduce XAMP… ▽ More

    Submitted 29 June, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  11. arXiv:2404.11672  [pdf, other

    cs.CL

    MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory

    Authors: Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze

    Abstract: While current large language models (LLMs) demonstrate some capabilities in knowledge-intensive tasks, they are limited by relying on their parameters as an implicit storage mechanism. As a result, they struggle with infrequent knowledge and temporal degradation. In addition, the uninterpretable nature of parametric memorization makes it challenging to understand and prevent hallucination. Paramet… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  12. Labeled Morphological Segmentation with Semi-Markov Models

    Authors: Ryan Cotterell, Thomas Müller, Alexander Fraser, Hinrich Schütze

    Abstract: We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks. From an annotation standpoint, we additionally introduce a new hierarchy of morphotactic tagsets. Finally, we develop \modelname, a discriminative morphological segmentation system that, contrary to previous work, explicitly models morphotactics. We show that \textsc{chipmunk}… ▽ More

    Submitted 13 April, 2024; originally announced April 2024.

    Comments: CoNLL 2015

  13. arXiv:2404.00790  [pdf, other

    cs.LG cs.CL

    Rehearsal-Free Modular and Compositional Continual Learning for Language Models

    Authors: Mingyang Wang, Heike Adel, Lukas Lange, Jannik Strötgen, Hinrich Schütze

    Abstract: Continual learning aims at incrementally acquiring new knowledge while not forgetting existing knowledge. To overcome catastrophic forgetting, methods are either rehearsal-based, i.e., store data examples from previous tasks for data replay, or isolate parameters dedicated to each task. However, rehearsal-based methods raise privacy and memory issues, and parameter-isolation continual learning doe… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

  14. arXiv:2403.17856  [pdf, other

    cs.CL

    Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs

    Authors: David R. Mortensen, Valentina Izrailevitch, Yunze Xiao, Hinrich Schütze, Leonie Weissweiler

    Abstract: Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to w… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  15. arXiv:2403.17760  [pdf, other

    cs.CL

    Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons

    Authors: Shijia Zhou, Leonie Weissweiler, Taiqi He, Hinrich Schütze, David R. Mortensen, Lori Levin

    Abstract: In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort… ▽ More

    Submitted 29 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  16. arXiv:2403.17748  [pdf, other

    cs.CL

    UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies

    Authors: Leonie Weissweiler, Nina Böbel, Kirian Guiller, Santiago Herrera, Wesley Scivetti, Arthur Lorenzi, Nurit Melnik, Archna Bhatia, Hinrich Schütze, Lori Levin, Amir Zeldes, Joakim Nivre, William Croft, Nathan Schneider

    Abstract: The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements -- for example, interrogative sentences with special markers and/or word orders -- are not labele… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  17. arXiv:2403.10293  [pdf, other

    cs.CL

    MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

    Authors: Verena Blaschke, Barbara Kovačić, Siyao Peng, Hinrich Schütze, Barbara Plank

    Abstract: Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth': most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  18. arXiv:2403.06965  [pdf, other

    cs.CL

    Hybrid Human-LLM Corpus Construction and LLM Evaluation for Rare Linguistic Phenomena

    Authors: Leonie Weissweiler, Abdullatif Köksal, Hinrich Schütze

    Abstract: Argument Structure Constructions (ASCs) are one of the most well-studied construction groups, providing a unique opportunity to demonstrate the usefulness of Construction Grammar (CxG). For example, the caused-motion construction (CMC, ``She sneezed the foam off her cappuccino'') demonstrates that constructions must carry meaning, otherwise the fact that ``sneeze'' in this context causes movement… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  19. arXiv:2402.18397  [pdf, other

    cs.CL

    Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models

    Authors: Ercong Nie, Shuzhou Yuan, Bolei Ma, Helmut Schmid, Michael Färber, Frauke Kreuter, Hinrich Schütze

    Abstract: Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence la… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Comments: 18 pages, 7 figures

  20. arXiv:2402.16786  [pdf, other

    cs.CL cs.AI

    Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

    Authors: Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy

    Abstract: Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificial… ▽ More

    Submitted 5 June, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: Accepted at ACL 2024 (Main Conference)

  21. arXiv:2402.11968  [pdf, other

    cs.CL

    What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects

    Authors: Verena Blaschke, Christoph Purschke, Hinrich Schütze, Barbara Plank

    Abstract: Natural language processing (NLP) has largely focused on modelling standardized languages. More recently, attention has increasingly shifted to local, non-standardized languages and dialects. However, the relevant speaker populations' needs and wishes with respect to NLP tools are largely unknown. In this paper, we focus on dialects and regional languages related to German -- a group of varieties… ▽ More

    Submitted 7 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: ACL 2024 main

  22. arXiv:2402.11709  [pdf, other

    cs.CL cs.AI

    GNNavi: Navigating the Information Flow in Large Language Models by Graph Neural Network

    Authors: Shuzhou Yuan, Ercong Nie, Michael Färber, Helmut Schmid, Hinrich Schütze

    Abstract: Large Language Models (LLMs) exhibit strong In-Context Learning (ICL) capabilities when prompts with demonstrations are used. However, fine-tuning still remains crucial to further enhance their adaptability. Prompt-based fine-tuning proves to be an effective fine-tuning method in low-data scenarios, but high demands on computing resources limit its practicality. We address this issue by introducin… ▽ More

    Submitted 7 June, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

    Comments: ACL2024 Findings

  23. arXiv:2401.16589  [pdf, other

    cs.CL

    ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks

    Authors: Bolei Ma, Ercong Nie, Shuzhou Yuan, Helmut Schmid, Michael Färber, Frauke Kreuter, Hinrich Schütze

    Abstract: Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt De… ▽ More

    Submitted 13 March, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: EACL 2024

  24. arXiv:2401.15207  [pdf, other

    cs.LG cs.CL

    HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

    Authors: Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze

    Abstract: Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero… ▽ More

    Submitted 17 June, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

    Comments: under progress

  25. arXiv:2401.13303  [pdf, other

    cs.CL

    MaLA-500: Massive Language Adaptation of Large Language Models

    Authors: Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann, André F. T. Martins, Hinrich Schütze

    Abstract: Large language models (LLMs) have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we em… ▽ More

    Submitted 3 April, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

  26. arXiv:2401.06620  [pdf, other

    cs.CL

    TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models

    Authors: Yihong Liu, Chunlan Ma, Haotian Ye, Hinrich Schütze

    Abstract: The world's more than 7000 languages are written in at least 293 scripts. Due to various reasons, many closely related languages use different scripts, which poses a difficulty for multilingual pretrained language models (mPLMs) in learning crosslingual knowledge through lexical overlap. As a consequence, mPLMs are faced with a script barrier: representations from different scripts are located in… ▽ More

    Submitted 23 May, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

    Comments: ACL 2024

  27. arXiv:2401.04821  [pdf, other

    cs.CL cs.AI

    MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer

    Authors: Haotian Ye, Yihong Liu, Chunlan Ma, Hinrich Schütze

    Abstract: Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available to high-resource languages. On the contrary, static word embeddings are easier to train in terms of computing resources and the amount of data required. In this… ▽ More

    Submitted 17 May, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

  28. arXiv:2311.12489  [pdf, other

    cs.CL

    Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages

    Authors: Viktor Hangya, Silvia Severini, Radoslav Ralev, Alexander Fraser, Hinrich Schütze

    Abstract: Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good cross-lingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted at the MRL 2023 workshop

  29. arXiv:2311.08849  [pdf, other

    cs.CL

    OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

    Authors: Yihong Liu, Peiqin Lin, Mingyang Wang, Hinrich Schütze

    Abstract: Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To ad… ▽ More

    Submitted 25 March, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Findings

  30. GlotLID: Language Identification for Low-Resource Languages

    Authors: Amir Hossein Kargaran, Ayyoob Imani, François Yvon, Hinrich Schütze

    Abstract: Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide covera… ▽ More

    Submitted 4 November, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  31. arXiv:2310.15269  [pdf, other

    cs.LG cs.CL

    GradSim: Gradient-Based Language Grou** for Effective Multilingual Training

    Authors: Mingyang Wang, Heike Adel, Lukas Lange, Jannik Strötgen, Hinrich Schütze

    Abstract: Most languages of the world pose low-resource challenges to natural language processing models. With multilingual training, knowledge can be shared among languages. However, not all languages positively influence each other and it is an open research question how to select the most suitable set of languages for multilingual training and avoid negative interference among languages whose characteris… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  32. arXiv:2310.15113  [pdf

    cs.CL

    Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

    Authors: Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, David R. Mortensen

    Abstract: Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (i… ▽ More

    Submitted 26 October, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  33. arXiv:2310.12020  [pdf, other

    cs.RO cs.CL cs.CV

    LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation

    Authors: Shengqiang Zhang, Philipp Wicke, Lütfi Kerem Şenel, Luis Figueredo, Abdeldjallil Naceri, Sami Haddadin, Barbara Plank, Hinrich Schütze

    Abstract: The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned… ▽ More

    Submitted 23 October, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

    Comments: 6 pages, 4 figures. The video and code of LoHoRavens are available at https://cisnlp.github.io/lohoravens-webpage/

  34. arXiv:2310.05069  [pdf, other

    cs.CL

    Unleashing the Multilingual Encoder Potential: Boosting Zero-Shot Performance via Probability Calibration

    Authors: Ercong Nie, Helmut Schmid, Hinrich Schütze

    Abstract: Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label words at the masked token position, without requiring any updates to the model parameters. However, the performance of this method is limited by the model's bias… ▽ More

    Submitted 19 October, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted to Findings of EMNLP 2023

  35. arXiv:2309.13320  [pdf, other

    cs.CL

    GlotScript: A Resource and Tool for Low Resource Writing System Identification

    Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze

    Abstract: We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it ret… ▽ More

    Submitted 27 March, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

    Comments: LREC-COLING 2024

  36. arXiv:2308.04645  [pdf, other

    cs.CL

    Cross-Lingual Constituency Parsing for Middle High German: A Delexicalized Approach

    Authors: Ercong Nie, Helmut Schmid, Hinrich Schütze

    Abstract: Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources.… ▽ More

    Submitted 29 August, 2023; v1 submitted 8 August, 2023; originally announced August 2023.

    Comments: Accepted to ALP 2023

  37. arXiv:2307.07880  [pdf, other

    cs.CL

    Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding

    Authors: Bolei Ma, Ercong Nie, Helmut Schmid, Hinrich Schütze

    Abstract: Multilingual pretrained language models (MPLMs) have demonstrated substantial performance improvements in zero-shot cross-lingual transfer across various natural language understanding tasks by finetuning MPLMs on task-specific labelled data of a source language (e.g. English) and evaluating on a wide range of target languages. Recent studies show that prompt-based finetuning surpasses regular fin… ▽ More

    Submitted 15 July, 2023; originally announced July 2023.

    Comments: KONVENS 2023

  38. arXiv:2306.14830  [pdf, other

    cs.RO

    Towards Language-Based Modulation of Assistive Robots through Multimodal Models

    Authors: Philipp Wicke, Lüfti Kerem Şenel, Shengqiang Zhang, Luis Figueredo, Abdeldjallil Naceri, Sami Haddadin, Hinrich Schütze

    Abstract: In the field of Geriatronics, enabling effective and transparent communication between humans and robots is crucial for enhancing the acceptance and performance of assistive robots. Our early-stage research project investigates the potential of language-based modulation as a means to improve human-robot interaction. We propose to explore real-time modulation during task execution, leveraging langu… ▽ More

    Submitted 27 June, 2023; v1 submitted 26 June, 2023; originally announced June 2023.

    Comments: GERIATRONICS SUMMIT 2023

  39. arXiv:2306.09752  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Politeness Stereotypes and Attack Vectors: Gender Stereotypes in Japanese and Korean Language Models

    Authors: Victor Steinborn, Antonis Maronikolakis, Hinrich Schütze

    Abstract: In efforts to keep up with the rapid progress and use of large language models, gender bias research is becoming more prevalent in NLP. Non-English bias research, however, is still in its infancy with most work focusing on English. In our work, we study how grammatical gender bias relating to politeness levels manifests in Japanese and Korean language models. Linguistic studies in these languages… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

  40. arXiv:2305.17182  [pdf, other

    cs.CL cs.AI

    On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss

    Authors: Yihong Liu, Alexandra Chronopoulou, Hinrich Schütze, Alexander Fraser

    Abstract: Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT).… ▽ More

    Submitted 4 June, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: IWSLT 2023

  41. arXiv:2305.15032  [pdf, other

    cs.CL cs.AI cs.LG

    How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

    Authors: Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, Barbara Plank

    Abstract: Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show t… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  42. arXiv:2305.14658  [pdf, other

    cs.CL

    Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response

    Authors: Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze

    Abstract: LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different… ▽ More

    Submitted 5 May, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: preprint

  43. arXiv:2305.14322  [pdf, other

    cs.CL

    RET-LLM: Towards a General Read-Write Memory for Large Language Models

    Authors: Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze

    Abstract: Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) through their extensive parameters and comprehensive data utilization. However, existing LLMs lack a dedicated memory unit, limiting their ability to explicitly store and retrieve knowledge for various tasks. In this paper, we propose RET-LLM a novel framework that equips LLMs with a general wri… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

  44. arXiv:2305.14250  [pdf, other

    cs.CL cs.AI

    Language Models with Rationality

    Authors: Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schuetze, Peter Clark

    Abstract: While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent "beliefs". This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that… ▽ More

    Submitted 29 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

  45. arXiv:2305.13684  [pdf, other

    cs.CL

    mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

    Authors: Peiqin Lin, Chengzhi Hu, Zheyu Zhang, André F. T. Martins, Hinrich Schütze

    Abstract: Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we… ▽ More

    Submitted 29 January, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EACL 2024 Findings

  46. arXiv:2305.13401  [pdf, other

    cs.CL

    A study of conceptual language similarity: comparison and evaluation

    Authors: Haotian Ye, Yihong Liu, Hinrich Schütze

    Abstract: An interesting line of research in natural language processing (NLP) aims to incorporate linguistic typology to bridge linguistic diversity and assist the research of low-resource languages. While most works construct linguistic similarity measures based on lexical or typological features, such as word order and verbal inflection, recent work has introduced a novel approach to defining language si… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  47. arXiv:2305.13302  [pdf, other

    cs.CL

    Language-Agnostic Bias Detection in Language Models with Bias Probing

    Authors: Abdullatif Köksal, Omer Faruk Yalcin, Ahmet Akbiyik, M. Tahir Kilavuz, Anna Korhonen, Hinrich Schütze

    Abstract: Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nation… ▽ More

    Submitted 20 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 Findings

  48. arXiv:2305.12818  [pdf, other

    cs.CL cs.AI

    Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs

    Authors: Yihong Liu, Haotian Ye, Leonie Weissweiler, Renhao Pei, Hinrich Schütze

    Abstract: In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then pr… ▽ More

    Submitted 19 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 Findings

  49. Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

    Authors: Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze

    Abstract: The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages an… ▽ More

    Submitted 26 May, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  50. arXiv:2305.08487  [pdf, other

    cs.CL

    Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

    Authors: Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Renhao Pei, Ehsaneddin Asgari, Hinrich Schütze

    Abstract: While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompas… ▽ More

    Submitted 4 June, 2024; v1 submitted 15 May, 2023; originally announced May 2023.