Skip to main content

Showing 1–13 of 13 results for author: van der Plas, L

.
  1. arXiv:2402.13016  [pdf, other

    cs.CL

    Understanding the effects of language-specific class imbalance in multilingual fine-tuning

    Authors: Vincent Jung, Lonneke van der Plas

    Abstract: We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: To be published in: Findings of the Association for Computational Linguistics: EACL 2024

  2. arXiv:2310.05597  [pdf, other

    cs.CL

    Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance

    Authors: Molly R. Petersen, Lonneke van der Plas

    Abstract: While analogies are a common way to evaluate word embeddings in NLP, it is also of interest to investigate whether or not analogical reasoning is a task in itself that can be learned. In this paper, we test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans than those in commonly used… ▽ More

    Submitted 3 May, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

  3. Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

    Authors: Kurt Micallef, Albert Gatt, Marc Tanti, Lonneke van der Plas, Claudia Borg

    Abstract: Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-train… ▽ More

    Submitted 26 May, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

    Comments: DeepLo 2022 camera-ready version

  4. arXiv:2111.07793  [pdf, ps, other

    cs.CL

    Analysis of Data Augmentation Methods for Low-Resource Maltese ASR

    Authors: Andrea DeMarco, Carlos Mena, Albert Gatt, Claudia Borg, Aiden Williams, Lonneke van der Plas

    Abstract: Recent years have seen an increased interest in the computational speech processing of Maltese, but resources remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for low-resource languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthe… ▽ More

    Submitted 20 January, 2023; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: 12 pages

  5. arXiv:2109.06935  [pdf, other

    cs.CL cs.NE

    On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning

    Authors: Marc Tanti, Lonneke van der Plas, Claudia Borg, Albert Gatt

    Abstract: Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks -- POS tagging and natural language inference -- which require the model to bring to bear different degrees of language-specific knowledge. Visualisat… ▽ More

    Submitted 26 December, 2021; v1 submitted 14 September, 2021; originally announced September 2021.

    Comments: 14 pages, 6 figures, 5 tables, submitted in BlackBoxNLP 2021 (https://aclanthology.org/2021.blackboxnlp-1.15/)

  6. arXiv:2008.06222  [pdf, other

    cs.CY cs.CL

    Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis

    Authors: Stavros Assimakopoulos, Rebecca Vella Muskat, Lonneke van der Plas, Albert Gatt

    Abstract: This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realization that hate speech is not… ▽ More

    Submitted 14 August, 2020; originally announced August 2020.

    Comments: 10 pages, 1 table. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20)

  7. arXiv:2008.05760  [pdf, other

    cs.CL cs.LG

    MASRI-HEADSET: A Maltese Corpus for Speech Recognition

    Authors: Carlos Mena, Albert Gatt, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Amanda Muscat, Ian Padovani

    Abstract: Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech pai… ▽ More

    Submitted 13 August, 2020; originally announced August 2020.

    Comments: 8 pages, 2 figures, 4 tables, 1 appendix. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20)

  8. arXiv:2007.11973  [pdf, other

    cs.AI econ.GN

    The societal and ethical relevance of computational creativity

    Authors: Michele Loi, Eleonora ViganĂ², Lonneke van der Plas

    Abstract: In this paper, we provide a philosophical account of the value of creative systems for individuals and society. We characterize creativity in very broad philosophical terms, encompassing natural, existential, and social creative processes, such as natural evolution and entrepreneurship, and explain why creativity understood in this way is instrumental for advancing human well-being in the long ter… ▽ More

    Submitted 23 July, 2020; originally announced July 2020.

    Comments: 4 pages, 1 figure, Eleventh International Conference on Computational Creativity, ICCC'20

    ACM Class: I.2.8

  9. arXiv:2006.11814  [pdf, ps, other

    cs.AI

    A blindspot of AI ethics: anti-fragility in statistical prediction

    Authors: Michele Loi, Lonneke van der Plas

    Abstract: With this paper, we aim to put an issue on the agenda of AI ethics that in our view is overlooked in the current discourse. The current discussions are dominated by topics suchas trustworthiness and bias, whereas the issue we like to focuson is counter to the debate on trustworthiness. We fear that the overuse of currently dominant AI systems that are driven by short-term objectives and optimized… ▽ More

    Submitted 21 June, 2020; originally announced June 2020.

    Comments: 7th Swiss Conference on Data Science (accepted as Poster)

    MSC Class: 62P25 ACM Class: K.4.2; I.2.0; K.7.4

  10. arXiv:1906.03634  [pdf, other

    cs.CL

    Learning to Predict Novel Noun-Noun Compounds

    Authors: Prajit Dhar, Lonneke van der Plas

    Abstract: We introduce temporally and contextually-aware models for the novel task of predicting unseen but plausible concepts, as conveyed by noun-noun compounds in a time-stamped corpus. We train compositional models on observed compounds, more specifically the composed distributed representations of their constituents across a time-stamped corpus, while giving it corrupted instances (where head or modifi… ▽ More

    Submitted 25 September, 2019; v1 submitted 9 June, 2019; originally announced June 2019.

    Comments: 9 pages, 3 figures, To appear at Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019) at ACL 2019. V3 - Fixed some typos and updated the Data Preprocessing section

  11. arXiv:1906.02563  [pdf, other

    cs.CL

    Measuring the compositionality of noun-noun compounds over time

    Authors: Prajit Dhar, Janis Pagel, Lonneke van der Plas

    Abstract: We present work in progress on the temporal progression of compositionality in noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of the compound as a whole is with respect to the meaning of its parts. We hypothesize that such a property might change over t… ▽ More

    Submitted 12 June, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: 6 pages, 3 figures, To appear in the proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change 2019 @ ACL 2019, Fixed typos, Increased figure sizes

  12. arXiv:1803.03827  [pdf, other

    cs.CL cs.AI cs.CV

    Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions

    Authors: Albert Gatt, Marc Tanti, Adrian Muscat, Patrizia Paggio, Reuben A. Farrugia, Claudia Borg, Kenneth P. Camilleri, Mike Rosner, Lonneke van der Plas

    Abstract: The past few years have witnessed renewed interest in NLP tasks at the interface between vision and language. One intensively-studied problem is that of automatically generating text from images. In this paper, we extend this problem to the more specific domain of face description. Unlike scene descriptions, face descriptions are more fine-grained and rely on attributes extracted from the image, r… ▽ More

    Submitted 5 March, 2021; v1 submitted 10 March, 2018; originally announced March 2018.

    Comments: Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC'18)

  13. arXiv:cs/0410062  [pdf

    cs.CL cs.DL cs.IR

    Automatic Keyword Extraction from Spoken Text. A Comparison of two Lexical Resources: the EDR and WordNet

    Authors: Lonneke van der Plas, Vincenzo Pallotta, Martin Rajman, Hatem Ghorbel

    Abstract: Lexical resources such as WordNet and the EDR electronic dictionary have been used in several NLP tasks. Probably, partly due to the fact that the EDR is not freely available, WordNet has been used far more often than the EDR. We have used both resources on the same task in order to make a comparison possible. The task is automatic assignment of keywords to multi-party dialogue episodes (i.e. th… ▽ More

    Submitted 24 October, 2004; originally announced October 2004.

    Comments: 4 pages

    ACM Class: H.3.1; H.3.3; I.5.3; I.7.3

    Journal ref: Procedings of the LREC 2004 international conference, 26-28 May 2004, Lisbon, Portugal. Pages 2205-2208