Search | arXiv e-print repository

Code-mixed Sentiment and Hate-speech Prediction

Authors: Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

Abstract: Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant… ▽ More Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2207.13988 [pdf, ps, other]

Sequence to sequence pretraining for a less-resourced Slovenian language

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summ… ▽ More Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summarization, question answering, text simplification, dialogue systems, etc. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages. In contrast, we trained two different sized T5-type sequence to sequence models for morphologically rich Slovene language with much less resources and analyzed their behavior on 11 tasks. Concerning classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model but are useful for the generative tasks. △ Less

Submitted 2 January, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

Comments: 19 pages

arXiv:2112.10553 [pdf, other]

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate the… ▽ More Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations. △ Less

Submitted 20 December, 2021; originally announced December 2021.

Comments: 12 pages. To be published in proceedings of the AIST 2021 conference

arXiv:2107.10614 [pdf, ps, other]

Evaluation of contextual embeddings on less-resourced languages

Authors: Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž Repar, Senja Pollak, Matthew Purver, Marko Robnik-Šikonja

Abstract: The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysi… ▽ More The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysis shows that monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task, where they are not competitive with ELMo models trained on large corpora. In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models. △ Less

Submitted 22 July, 2021; originally announced July 2021.

Comments: 45 pages

arXiv:2106.15986 [pdf, other]

doi 10.1007/s00521-022-07164-x

Cross-lingual alignments of ELMo contextual embeddings

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To pro… ▽ More Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To produce cross-lingual map**s of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context. We address this issue with a novel method for creating cross-lingual contextual alignment datasets. Based on that, we propose several cross-lingual map** methods for ELMo embeddings. The proposed linear map** methods use existing Vecmap and MUSE alignments on contextual ELMo embeddings. Novel nonlinear ELMoGAN map** methods are based on GANs and do not assume isomorphic embedding spaces. We evaluate the proposed map** methods on nine languages, using four downstream tasks: named entity recognition (NER), dependency parsing (DP), terminology alignment, and sentiment analysis. The ELMoGAN methods perform very well on the NER and terminology alignment tasks, with a lower cross-lingual loss for NER compared to the direct training on some languages. In DP and sentiment analysis, linear contextual alignment variants are more successful. △ Less

Submitted 22 July, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: 30 pages, 5 figures

Journal ref: Neural Computing and Applications, 2022

arXiv:2006.07890 [pdf, ps, other]

FinEst BERT and CroSloEngual BERT: less is more in multilingual models

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian,… ▽ More Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations △ Less

Submitted 14 June, 2020; originally announced June 2020.

Comments: 10 pages, accepted at TSD 2020 conference

Journal ref: Proceedings of the 23rd Internetional Conference on Text, Speech, and Dialogue (TSD 2020), pages 104-111

arXiv:1912.05320 [pdf, other]

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

Authors: Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik

Abstract: State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous… ▽ More State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous measures of meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended to fill this gap. Building on the standard pairwise similarity task of SimLex-999, it provides context-dependent similarity measures; covers not only discrete differences in word sense but more subtle, graded changes in meaning; and covers not only a well-resourced language (English) but a number of less-resourced languages. We define the task and evaluation metrics, outline the dataset collection methodology, and describe the status of the dataset so far. △ Less

Submitted 29 October, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

ACM Class: I.2.7

Journal ref: Proceedings of the 12th Language Resources and Evaluation Conference (2020) 5878-5886

arXiv:1911.10049 [pdf, other]

High Quality ELMo Embeddings for Seven Less-Resourced Languages

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the… ▽ More Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the size of training set and show that existing publicly available ELMo embeddings for listed languages shall be improved. We train new ELMo embeddings on much larger training sets and show their advantage over baseline non-contextual FastText embeddings. In evaluation, we use two benchmarks, the analogy task and the NER task. △ Less

Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: 8 pages, 3 figures, LREC2020 conference

Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4731-4738

arXiv:1911.10038 [pdf, ps, other]

Multilingual Culture-Independent Word Analogy Datasets

Authors: Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja

Abstract: In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English,… ▽ More In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings. △ Less

Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: 7 pages, LREC2020 conference

ACM Class: J.5

Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4074-4080

Showing 1–9 of 9 results for author: Ulcar, M