Skip to main content

Showing 1–32 of 32 results for author: van der Goot, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.13760  [pdf, other

    cs.CL

    How to Encode Domain Information in Relation Classification

    Authors: Elisa Bassignana, Viggo Unmack Gascou, Frida Nøhr Laustsen, Gustav Kristensen, Marie Haahr Petersen, Rob van der Goot, Barbara Plank

    Abstract: Current language models require a lot of training data to obtain high performance. For Relation Classification (RC), many datasets are domain-specific, so combining datasets to obtain better performance is non-trivial. We explore a multi-domain training setup for RC, and attempt to improve performance by encoding domain information. Our proposed models improve > 2 Macro-F1 against the baseline set… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: Accepted at LREC-COLING 2024

  2. arXiv:2404.01785  [pdf, other

    cs.CL

    Can Humans Identify Domains?

    Authors: Maria Barrett, Max Müller-Eberstein, Elisa Bassignana, Amalie Brogaard Pauli, Mike Zhang, Rob van der Goot

    Abstract: Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance. The concept itself is, however, loosely defined and, in practice, refers to any non-typological property, such as genre, topic, medium or style of a document. We investigate the core notion of domains via human proficiency in identifying related intrinsic… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted at LREC-COLING 2024

  3. arXiv:2403.08046  [pdf, other

    cs.CL

    Big City Bias: Evaluating the Impact of Metropolitan Size on Computational Job Market Abilities of Language Models

    Authors: Charlie Campanella, Rob van der Goot

    Abstract: Large language models (LLMs) have emerged as a useful technology for job matching, for both candidates and employers. Job matching is often based on a particular geographic location, such as a city or region. However, LLMs have known biases, commonly derived from their training data. In this work, we aim to quantify the metropolitan size bias encoded within large language models, evaluating zero-s… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 5 pages, 3 figures, 2 tables, NLP4HR Workshop @ EACL 2024

    MSC Class: I.2.7

  4. arXiv:2402.05617  [pdf, other

    cs.CL

    Deep Learning-based Computational Job Market Analysis: A Survey on Skill Extraction and Classification from Job Postings

    Authors: Elena Senger, Mike Zhang, Rob van der Goot, Barbara Plank

    Abstract: Recent years have brought significant advances to Natural Language Processing (NLP), which enabled fast progress in the field of computational job market analysis. Core tasks in this application domain are skill extraction and classification from job postings. Because of its quick growth and its interdisciplinary nature, there is no exhaustive assessment of this emerging field. This survey aims to… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: Published at NLP4HR 2024 (EACL Workshop)

  5. arXiv:2402.02864  [pdf, other

    cs.CL cs.HC

    EEVEE: An Easy Annotation Tool for Natural Language Processing

    Authors: Axel Sorensen, Siyao Peng, Barbara Plank, Rob van der Goot

    Abstract: Annotation tools are the starting point for creating Natural Language Processing (NLP) datasets. There is a wide variety of tools available; setting up these tools is however a hindrance. We propose EEVEE, an annotation tool focused on simplicity, efficiency, and ease of use. It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or tas… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 6 pages; accepted to The Linguistic Annotation Workshop (LAW) at EACL 2024

  6. arXiv:2401.17979  [pdf, other

    cs.CL

    Entity Linking in the Job Market Domain

    Authors: Mike Zhang, Rob van der Goot, Barbara Plank

    Abstract: In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Pr… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted at EACL 2024 Findings

  7. arXiv:2401.17092  [pdf, other

    cs.CL

    NNOSE: Nearest Neighbor Occupational Skill Extraction

    Authors: Mike Zhang, Rob van der Goot, Min-Yen Kan, Barbara Plank

    Abstract: The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks -- combining and leveraging multiple datasets for skill extraction, to identify rare… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

    Comments: Accepted at EACL 2024 Main

  8. arXiv:2310.16484  [pdf, other

    cs.CL

    Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training

    Authors: Max Müller-Eberstein, Rob van der Goot, Barbara Plank, Ivan Titov

    Abstract: Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational s… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP 2023 (Findings)

  9. arXiv:2310.05442  [pdf, other

    cs.CL

    Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

    Authors: Robert Litschko, Max Müller-Eberstein, Rob van der Goot, Leon Weber, Barbara Plank

    Abstract: Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has w… ▽ More

    Submitted 23 October, 2023; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP 2023 (Main Conference), camera-ready

  10. arXiv:2305.20080  [pdf, other

    cs.CL

    Findings of the VarDial Evaluation Campaign 2023

    Authors: Noëmi Aepli, Çağrı Çöltekin, Rob Van Der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri

    Abstract: This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR),… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Journal ref: In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 251-261, Dubrovnik, Croatia. Association from Computational Linguistics

  11. arXiv:2305.12092  [pdf, other

    cs.CL

    ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain

    Authors: Mike Zhang, Rob van der Goot, Barbara Plank

    Abstract: The increasing number of benchmarks for Natural Language Processing (NLP) tasks in the computational job market domain highlights the demand for methods that can handle job-related tasks such as skill extraction, skill classification, job title classification, and de-identification. While some approaches have been developed that are specific to the job market domain, there is a lack of generalized… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

    Comments: Accepted at ACL2023 (Main)

  12. arXiv:2305.11016  [pdf, other

    cs.CL

    Silver Syntax Pre-training for Cross-Domain Relation Extraction

    Authors: Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

    Abstract: Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted in Findings of the Association for Computational Linguistics: ACL 2023

  13. arXiv:2305.10985  [pdf, other

    cs.CL

    Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction

    Authors: Elisa Bassignana, Filip Ginter, Sampo Pyysalo, Rob van der Goot, Barbara Plank

    Abstract: Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose Multi-CrossRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. Multi-CrossRE is a machine translated version of CrossRE (Bassignana and Plank, 2022), with a sub-portion including more than 200… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Accepted at NoDaLiDa 2023

  14. arXiv:2304.13989  [pdf, other

    cs.CL

    Cross-Domain Evaluation of POS Taggers: From Wall Street Journal to Fandom Wiki

    Authors: Kia Kirstein Hansen, Rob van der Goot

    Abstract: The Wall Street Journal section of the Penn Treebank has been the de-facto standard for evaluating POS taggers for a long time, and accuracies over 97\% have been reported. However, less is known about out-of-domain tagger performance, especially with fine-grained label sets. Using data from Elder Scrolls Fandom, a wiki about the \textit{Elder Scrolls} video game universe, we create a modest datas… ▽ More

    Submitted 27 April, 2023; originally announced April 2023.

  15. arXiv:2210.11860  [pdf, other

    cs.CL

    Spectral Probing

    Authors: Max Müller-Eberstein, Rob van der Goot, Barbara Plank

    Abstract: Linguistic information is encoded at varying timescales (subwords, phrases, etc.) and communicative levels, such as syntax and semantics. Contextualized embeddings have analogously been found to capture these phenomena at distinctive layers and frequencies. Leveraging these findings, we develop a fully learnable frequency filter to identify spectral profiles for any given task. It enables vastly m… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Comments: Accepted at EMNLP 2022 (Main Conference)

  16. arXiv:2209.08071  [pdf, other

    cs.CL

    Skill Extraction from Job Postings using Weak Supervision

    Authors: Mike Zhang, Kristian Nørgaard Jensen, Rob van der Goot, Barbara Plank

    Abstract: Aggregated data obtained from job postings provide powerful insights into labor market demands, and emerging skills, and aid job matching. However, most extraction approaches are supervised and thus need costly and time-consuming annotation. To overcome this, we propose Skill Extraction with Weak Supervision. We leverage the European Skills, Competences, Qualifications and Occupations taxonomy to… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

    Comments: Accepted in RecSys in HR'22: The 2nd Workshop on Recommender Systems for Human Resources, in conjunction with the 16th ACM Conference on Recommender Systems

  17. arXiv:2206.04935  [pdf, other

    cs.CL

    Sort by Structure: Language Model Ranking as Dependency Probing

    Authors: Max Müller-Eberstein, Rob van der Goot, Barbara Plank

    Abstract: Making an informed choice of pre-trained language model (LM) is critical for performance, yet environmentally costly, and as such widely underexplored. The field of Computer Vision has begun to tackle encoder ranking, with promising forays into Natural Language Processing, however they lack coverage of linguistic tasks such as structured prediction. We propose probing to rank LMs, specifically for… ▽ More

    Submitted 10 June, 2022; originally announced June 2022.

    Comments: Accepted at NAACL 2022 (Main Conference)

  18. arXiv:2204.06251  [pdf, other

    cs.LG cs.CL

    Experimental Standards for Deep Learning in Natural Language Processing Research

    Authors: Dennis Ulmer, Elisa Bassignana, Max Müller-Eberstein, Daniel Varab, Mike Zhang, Rob van der Goot, Christian Hardmeier, Barbara Plank

    Abstract: The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, compared to more established disciplines, a lack of common experimental standards remains an open challenge to the field at large. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards… ▽ More

    Submitted 17 October, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

  19. arXiv:2203.12971  [pdf, other

    cs.CL

    Probing for Labeled Dependency Trees

    Authors: Max Müller-Eberstein, Rob van der Goot, Barbara Plank

    Abstract: Probing has become an important tool for analyzing representations in Natural Language Processing (NLP). For graphical NLP tasks such as dependency parsing, linear probes are currently limited to extracting undirected or unlabeled parse trees which do not capture the full task. This work introduces DepProbe, a linear probe which can extract labeled and directed dependency parse trees from embeddin… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted at ACL 2022 (Main Conference)

  20. arXiv:2112.04971  [pdf, other

    cs.CL

    How Universal is Genre in Universal Dependencies?

    Authors: Max Müller-Eberstein, Rob van der Goot, Barbara Plank

    Abstract: This work provides the first in-depth analysis of genre in Universal Dependencies (UD). In contrast to prior work on genre identification which uses small sets of well-defined labels in mono-/bilingual setups, UD contains 18 genres with varying degrees of specificity spread across 114 languages. As most treebanks are labeled with multiple genres while lacking annotations about which instances belo… ▽ More

    Submitted 9 December, 2021; originally announced December 2021.

    Comments: Accepted at SyntaxFest 2021

  21. arXiv:2112.03625  [pdf, other

    cs.CL

    Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

    Authors: Rob van der Goot, Miryam de Lhoneux

    Abstract: With an increase of dataset availability, the potential for learning from a variety of data sources has increased. One particular method to improve learning from multiple data sources is to embed the data source during training. This allows the model to learn generalizable features as well as distinguishing features between datasets. However, these dataset embeddings have mostly been used before c… ▽ More

    Submitted 7 December, 2021; originally announced December 2021.

    Comments: Accepted to TLT at SyntaxFest 2021

  22. arXiv:2109.04733  [pdf, other

    cs.CL

    Genre as Weak Supervision for Cross-lingual Dependency Parsing

    Authors: Max Müller-Eberstein, Rob van der Goot, Barbara Plank

    Abstract: Recent work has shown that monolingual masked language models learn to represent data-driven notions of language variation which can be used for domain-targeted training data selection. Dataset genre labels are already frequently available, yet remain largely unexplored in cross-lingual setups. We harness this genre metadata as a weak supervision signal for targeted data selection in zero-shot dep… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021 (Main Conference)

  23. arXiv:2105.11301  [pdf, other

    cs.CL

    DaN+: Danish Nested Named Entities and Lexical Normalization

    Authors: Barbara Plank, Kristian Nørgaard Jensen, Rob van der Goot

    Abstract: This paper introduces DaN+, a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language. We empirically assess three strategies to model the two-layer Named Entity Recognition (NER) task. We compare transfer capabilities from German versus in-language an… ▽ More

    Submitted 24 May, 2021; originally announced May 2021.

    Comments: COLING 2020

  24. arXiv:2105.07316  [pdf, other

    cs.CL

    From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding

    Authors: Rob van der Goot, Ibrahim Sharaf, Aizhan Imankulova, Ahmet Üstün, Marija Stepanović, Alan Ramponi, Siti Oryza Khairunnisa, Mamoru Komachi, Barbara Plank

    Abstract: The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual Slot and Intent… ▽ More

    Submitted 15 May, 2021; originally announced May 2021.

    Comments: To appear in the proceedings of NAACL 2021

  25. arXiv:2103.01273  [pdf, other

    cs.CL

    On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual and Zero-shot Conditions

    Authors: Rob van der Goot, Ahmet Üstün, Barbara Plank

    Abstract: Recent complementary strands of research have shown that leveraging information on the data source through encoding their properties into embeddings can lead to performance increase when training a single model on heterogeneous data sources. However, it remains unclear in which situations these dataset embeddings are most effective, because they are used in a large variety of settings, languages a… ▽ More

    Submitted 5 March, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

  26. arXiv:2102.11152  [pdf, ps, other

    cs.CL

    Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data

    Authors: Anouck Braggaar, Rob van der Goot

    Abstract: This paper explores the difficulties of annotating transcribed spoken Dutch-Frisian code-switch utterances into Universal Dependencies. We make use of data from the FAME! corpus, which consists of transcriptions and audio data. Besides the usual annotation difficulties, this dataset is extra challenging because of Frisian being low-resource, the informal nature of the data, code-switching and non-… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Comments: RESOURCEFUL-2020

  27. arXiv:2006.01175  [pdf, other

    cs.CL

    Lexical Normalization for Code-switched Data and its Effect on POS-tagging

    Authors: Rob van der Goot, Özlem Çetinoğlu

    Abstract: Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalizatio… ▽ More

    Submitted 31 January, 2021; v1 submitted 1 June, 2020; originally announced June 2020.

  28. arXiv:2005.14672  [pdf, other

    cs.CL

    Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP

    Authors: Rob van der Goot, Ahmet Üstün, Alan Ramponi, Ibrahim Sharaf, Barbara Plank

    Abstract: Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration option… ▽ More

    Submitted 11 March, 2021; v1 submitted 29 May, 2020; originally announced May 2020.

    Comments: EACL demo version (MaChAmp 0.2) https://machamp-nlp.github.io/

  29. arXiv:1905.09866  [pdf, other

    cs.CL

    Fair is Better than Sensational:Man is to Doctor as Woman is to Doctor

    Authors: Malvina Nissim, Rik van Noord, Rob van der Goot

    Abstract: Analogies such as "man is to king as woman is to X" are often used to illustrate the amazing power of word embeddings. Concurrently, they have also been used to expose how strongly human biases are encoded in vector spaces built on natural language, like "man is to computer programmer as woman is to homemaker". Recent work has shown that analogies are in fact not such a diagnostic for bias, and ot… ▽ More

    Submitted 9 November, 2019; v1 submitted 23 May, 2019; originally announced May 2019.

  30. arXiv:1805.03122  [pdf, other

    cs.CL

    Bleaching Text: Abstract Features for Cross-lingual Gender Prediction

    Authors: Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, Barbara Plank

    Abstract: Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform-dependent. Cross-lingual embeddings circumvent some of these limitations, but capture gender-specific style less. We propose an alternative: bleaching text, i.e., transforming lexical strings into more abstract features. This study pro… ▽ More

    Submitted 8 May, 2018; originally announced May 2018.

    Comments: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

  31. arXiv:1710.03476  [pdf, other

    cs.CL

    MoNoise: Modeling Noise Using a Modular Normalization System

    Authors: Rob van der Goot, Gertjan van Noord

    Abstract: We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a… ▽ More

    Submitted 10 October, 2017; originally announced October 2017.

    Comments: Source code: https://bitbucket.org/robvanderg/monoise

  32. arXiv:1707.05116  [pdf, other

    cs.CL

    To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

    Authors: Rob van der Goot, Barbara Plank, Malvina Nissim

    Abstract: Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of… ▽ More

    Submitted 17 July, 2017; originally announced July 2017.

    Comments: In WNUT 2017