Search | arXiv e-print repository

Understanding the effects of language-specific class imbalance in multilingual fine-tuning

Authors: Vincent Jung, Lonneke van der Plas

Abstract: We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative… ▽ More We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: To be published in: Findings of the Association for Computational Linguistics: EACL 2024

arXiv:2310.05597 [pdf, other]

Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance

Authors: Molly R. Petersen, Lonneke van der Plas

Abstract: While analogies are a common way to evaluate word embeddings in NLP, it is also of interest to investigate whether or not analogical reasoning is a task in itself that can be learned. In this paper, we test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans than those in commonly used… ▽ More While analogies are a common way to evaluate word embeddings in NLP, it is also of interest to investigate whether or not analogical reasoning is a task in itself that can be learned. In this paper, we test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans than those in commonly used NLP benchmarks. Our experiments find that models are able to learn analogical reasoning, even with a small amount of data. We additionally compare our models to a dataset with a human baseline, and find that after training, models approach human performance. △ Less

Submitted 3 May, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

arXiv:2205.10517 [pdf, other]

doi 10.18653/v1/2022.deeplo-1.10

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Authors: Kurt Micallef, Albert Gatt, Marc Tanti, Lonneke van der Plas, Claudia Borg

Abstract: Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-train… ▽ More Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks. △ Less

Submitted 26 May, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

Comments: DeepLo 2022 camera-ready version

arXiv:2111.07793 [pdf, ps, other]

Analysis of Data Augmentation Methods for Low-Resource Maltese ASR

Authors: Andrea DeMarco, Carlos Mena, Albert Gatt, Claudia Borg, Aiden Williams, Lonneke van der Plas

Abstract: Recent years have seen an increased interest in the computational speech processing of Maltese, but resources remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for low-resource languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthe… ▽ More Recent years have seen an increased interest in the computational speech processing of Maltese, but resources remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for low-resource languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model. △ Less

Submitted 20 January, 2023; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: 12 pages

arXiv:2109.06935 [pdf, other]

On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning

Authors: Marc Tanti, Lonneke van der Plas, Claudia Borg, Albert Gatt

Abstract: Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks -- POS tagging and natural language inference -- which require the model to bring to bear different degrees of language-specific knowledge. Visualisat… ▽ More Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks -- POS tagging and natural language inference -- which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on 'unlearning' language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model's limited representational capacity, enhancing language-independent representations at the expense of language-specific ones. △ Less

Submitted 26 December, 2021; v1 submitted 14 September, 2021; originally announced September 2021.

Comments: 14 pages, 6 figures, 5 tables, submitted in BlackBoxNLP 2021 (https://aclanthology.org/2021.blackboxnlp-1.15/)

arXiv:2008.06222 [pdf, other]

Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis

Authors: Stavros Assimakopoulos, Rebecca Vella Muskat, Lonneke van der Plas, Albert Gatt

Abstract: This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realization that hate speech is not… ▽ More This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realization that hate speech is not a clear-cut category to begin with, appears to belong to a continuum of discriminatory discourse and is often realized through the use of indirect linguistic means, it is argued that annotation schemes for its detection should refrain from directly including the label 'hate speech,' as different annotators might have different thresholds as to what constitutes hate speech and what not. In view of this, we suggest a multi-layer annotation scheme, which is pilot-tested against a binary +/- hate speech classification and appears to yield higher inter-annotator agreement. Motivating the postulation of our scheme, we then present the MaNeCo corpus on which it will eventually be used; a substantial corpus of on-line newspaper comments spanning 10 years. △ Less

Submitted 14 August, 2020; originally announced August 2020.

Comments: 10 pages, 1 table. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20)

arXiv:2008.05760 [pdf, other]

MASRI-HEADSET: A Maltese Corpus for Speech Recognition

Authors: Carlos Mena, Albert Gatt, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Amanda Muscat, Ian Padovani

Abstract: Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech pai… ▽ More Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI-HEADSET Corpus is publicly available for research/academic purposes. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: 8 pages, 2 figures, 4 tables, 1 appendix. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20)

arXiv:2007.11973 [pdf, other]

The societal and ethical relevance of computational creativity

Authors: Michele Loi, Eleonora Viganò, Lonneke van der Plas

Abstract: In this paper, we provide a philosophical account of the value of creative systems for individuals and society. We characterize creativity in very broad philosophical terms, encompassing natural, existential, and social creative processes, such as natural evolution and entrepreneurship, and explain why creativity understood in this way is instrumental for advancing human well-being in the long ter… ▽ More In this paper, we provide a philosophical account of the value of creative systems for individuals and society. We characterize creativity in very broad philosophical terms, encompassing natural, existential, and social creative processes, such as natural evolution and entrepreneurship, and explain why creativity understood in this way is instrumental for advancing human well-being in the long term. We then explain why current mainstream AI tends to be anti-creative, which means that there are moral costs of employing this type of AI in human endeavors, although computational systems that involve creativity are on the rise. In conclusion, there is an argument for ethics to be more hospitable to creativity-enabling AI, which can also be in a trade-off with other values promoted in AI ethics, such as its explainability and accuracy. △ Less

Submitted 23 July, 2020; originally announced July 2020.

Comments: 4 pages, 1 figure, Eleventh International Conference on Computational Creativity, ICCC'20

ACM Class: I.2.8

arXiv:2006.11814 [pdf, ps, other]

A blindspot of AI ethics: anti-fragility in statistical prediction

Authors: Michele Loi, Lonneke van der Plas

Abstract: With this paper, we aim to put an issue on the agenda of AI ethics that in our view is overlooked in the current discourse. The current discussions are dominated by topics suchas trustworthiness and bias, whereas the issue we like to focuson is counter to the debate on trustworthiness. We fear that the overuse of currently dominant AI systems that are driven by short-term objectives and optimized… ▽ More With this paper, we aim to put an issue on the agenda of AI ethics that in our view is overlooked in the current discourse. The current discussions are dominated by topics suchas trustworthiness and bias, whereas the issue we like to focuson is counter to the debate on trustworthiness. We fear that the overuse of currently dominant AI systems that are driven by short-term objectives and optimized for avoiding error leads to a society that loses its diversity and flexibility needed for true progress. We couch our concerns in the discourse around the term anti-fragility and show with some examples what threats current methods used for decision making pose for society. △ Less

Submitted 21 June, 2020; originally announced June 2020.

Comments: 7th Swiss Conference on Data Science (accepted as Poster)

MSC Class: 62P25 ACM Class: K.4.2; I.2.0; K.7.4

arXiv:1906.03634 [pdf, other]

Learning to Predict Novel Noun-Noun Compounds

Authors: Prajit Dhar, Lonneke van der Plas

Abstract: We introduce temporally and contextually-aware models for the novel task of predicting unseen but plausible concepts, as conveyed by noun-noun compounds in a time-stamped corpus. We train compositional models on observed compounds, more specifically the composed distributed representations of their constituents across a time-stamped corpus, while giving it corrupted instances (where head or modifi… ▽ More We introduce temporally and contextually-aware models for the novel task of predicting unseen but plausible concepts, as conveyed by noun-noun compounds in a time-stamped corpus. We train compositional models on observed compounds, more specifically the composed distributed representations of their constituents across a time-stamped corpus, while giving it corrupted instances (where head or modifier are replaced by a random constituent) as negative evidence. The model captures generalisations over this data and learns what combinations give rise to plausible compounds and which ones do not. After training, we query the model for the plausibility of automatically generated novel combinations and verify whether the classifications are accurate. For our best model, we find that in around 85% of the cases, the novel compounds generated are attested in previously unseen data. An additional estimated 5% are plausible despite not being attested in the recent corpus, based on judgments from independent human raters. △ Less

Submitted 25 September, 2019; v1 submitted 9 June, 2019; originally announced June 2019.

Comments: 9 pages, 3 figures, To appear at Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019) at ACL 2019. V3 - Fixed some typos and updated the Data Preprocessing section

arXiv:1906.02563 [pdf, other]

Measuring the compositionality of noun-noun compounds over time

Authors: Prajit Dhar, Janis Pagel, Lonneke van der Plas

Abstract: We present work in progress on the temporal progression of compositionality in noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of the compound as a whole is with respect to the meaning of its parts. We hypothesize that such a property might change over t… ▽ More We present work in progress on the temporal progression of compositionality in noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of the compound as a whole is with respect to the meaning of its parts. We hypothesize that such a property might change over time. We use the time-stamped Google Books corpus for our diachronic investigations, and first examine whether the vector-based semantic spaces extracted from this corpus are able to predict compositionality ratings, despite their inherent limitations. We find that using temporal information helps predicting the ratings, although correlation with the ratings is lower than reported for other corpora. Finally, we show changes in compositionality over time for a selection of compounds. △ Less

Submitted 12 June, 2019; v1 submitted 6 June, 2019; originally announced June 2019.

Comments: 6 pages, 3 figures, To appear in the proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change 2019 @ ACL 2019, Fixed typos, Increased figure sizes

arXiv:1803.03827 [pdf, other]

Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions

Authors: Albert Gatt, Marc Tanti, Adrian Muscat, Patrizia Paggio, Reuben A. Farrugia, Claudia Borg, Kenneth P. Camilleri, Mike Rosner, Lonneke van der Plas

Abstract: The past few years have witnessed renewed interest in NLP tasks at the interface between vision and language. One intensively-studied problem is that of automatically generating text from images. In this paper, we extend this problem to the more specific domain of face description. Unlike scene descriptions, face descriptions are more fine-grained and rely on attributes extracted from the image, r… ▽ More The past few years have witnessed renewed interest in NLP tasks at the interface between vision and language. One intensively-studied problem is that of automatically generating text from images. In this paper, we extend this problem to the more specific domain of face description. Unlike scene descriptions, face descriptions are more fine-grained and rely on attributes extracted from the image, rather than objects and relations. Given that no data exists for this task, we present an ongoing crowdsourcing study to collect a corpus of descriptions of face images taken `in the wild'. To gain a better understanding of the variation we find in face description and the possible issues that this may raise, we also conducted an annotation study on a subset of the corpus. Primarily, we found descriptions to refer to a mixture of attributes, not only physical, but also emotional and inferential, which is bound to create further challenges for current image-to-text methods. △ Less

Submitted 5 March, 2021; v1 submitted 10 March, 2018; originally announced March 2018.

Comments: Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC'18)

arXiv:cs/0410062 [pdf]

Automatic Keyword Extraction from Spoken Text. A Comparison of two Lexical Resources: the EDR and WordNet

Authors: Lonneke van der Plas, Vincenzo Pallotta, Martin Rajman, Hatem Ghorbel

Abstract: Lexical resources such as WordNet and the EDR electronic dictionary have been used in several NLP tasks. Probably, partly due to the fact that the EDR is not freely available, WordNet has been used far more often than the EDR. We have used both resources on the same task in order to make a comparison possible. The task is automatic assignment of keywords to multi-party dialogue episodes (i.e. th… ▽ More Lexical resources such as WordNet and the EDR electronic dictionary have been used in several NLP tasks. Probably, partly due to the fact that the EDR is not freely available, WordNet has been used far more often than the EDR. We have used both resources on the same task in order to make a comparison possible. The task is automatic assignment of keywords to multi-party dialogue episodes (i.e. thematically coherent stretches of spoken text). We show that the use of lexical resources in such a task results in slightly higher performances than the use of a purely statistically based method. △ Less

Submitted 24 October, 2004; originally announced October 2004.

Comments: 4 pages

ACM Class: H.3.1; H.3.3; I.5.3; I.7.3

Journal ref: Procedings of the LREC 2004 international conference, 26-28 May 2004, Lisbon, Portugal. Pages 2205-2208

Showing 1–13 of 13 results for author: van der Plas, L