Skip to main content

Showing 1–9 of 9 results for author: Ulcar, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.12929  [pdf, other

    cs.CL

    Code-mixed Sentiment and Hate-speech Prediction

    Authors: Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

    Abstract: Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  2. arXiv:2207.13988  [pdf, ps, other

    cs.CL

    Sequence to sequence pretraining for a less-resourced Slovenian language

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summ… ▽ More

    Submitted 2 January, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

    Comments: 19 pages

  3. arXiv:2112.10553  [pdf, other

    cs.CL

    Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate the… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: 12 pages. To be published in proceedings of the AIST 2021 conference

  4. arXiv:2107.10614  [pdf, ps, other

    cs.CL

    Evaluation of contextual embeddings on less-resourced languages

    Authors: Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž Repar, Senja Pollak, Matthew Purver, Marko Robnik-Šikonja

    Abstract: The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysi… ▽ More

    Submitted 22 July, 2021; originally announced July 2021.

    Comments: 45 pages

  5. Cross-lingual alignments of ELMo contextual embeddings

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To pro… ▽ More

    Submitted 22 July, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

    Comments: 30 pages, 5 figures

    Journal ref: Neural Computing and Applications, 2022

  6. arXiv:2006.07890  [pdf, ps, other

    cs.CL

    FinEst BERT and CroSloEngual BERT: less is more in multilingual models

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian,… ▽ More

    Submitted 14 June, 2020; originally announced June 2020.

    Comments: 10 pages, accepted at TSD 2020 conference

    Journal ref: Proceedings of the 23rd Internetional Conference on Text, Speech, and Dialogue (TSD 2020), pages 104-111

  7. arXiv:1912.05320  [pdf, other

    cs.CL

    CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

    Authors: Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik

    Abstract: State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous… ▽ More

    Submitted 29 October, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    ACM Class: I.2.7

    Journal ref: Proceedings of the 12th Language Resources and Evaluation Conference (2020) 5878-5886

  8. arXiv:1911.10049  [pdf, other

    cs.CL cs.LG

    High Quality ELMo Embeddings for Seven Less-Resourced Languages

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the… ▽ More

    Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

    Comments: 8 pages, 3 figures, LREC2020 conference

    Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4731-4738

  9. arXiv:1911.10038  [pdf, ps, other

    cs.CL

    Multilingual Culture-Independent Word Analogy Datasets

    Authors: Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja

    Abstract: In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English,… ▽ More

    Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

    Comments: 7 pages, LREC2020 conference

    ACM Class: J.5

    Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4074-4080