Skip to main content

Showing 1–20 of 20 results for author: Ljubešić, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.07363  [pdf, ps, other

    cs.CL

    Multilingual Power and Ideology Identification in the Parliament: a Reference Dataset and Simple Baselines

    Authors: Çağrı Çöltekin, Matyáš Kopp, Katja Meden, Vaidas Morkevicius, Nikola Ljubešić, Tomaž Erjavec

    Abstract: We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the dataset, provide the reasoning behind some of the choices during its creation, present statistics on the dataset, and, using a simple classifier, some b… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

  2. arXiv:2404.05428  [pdf, other

    cs.CL

    Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

    Authors: Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

    Abstract: The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigat… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  3. arXiv:2403.12721  [pdf, other

    cs.CL

    CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

    Authors: Nikola Ljubešić, Taja Kuzman

    Abstract: This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a co… ▽ More

    Submitted 26 March, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Accepted to the LREC-COLING 2024 conference

    Journal ref: https://aclanthology.org/2024.lrec-main.291

  4. arXiv:2403.08693  [pdf, other

    cs.CL

    Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

    Authors: Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral

    Abstract: Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant larg… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted to LREC-COLING 2024 (long)

  5. arXiv:2311.09122  [pdf, other

    cs.CL

    Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Authors: Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter

    Abstract: We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse langu… ▽ More

    Submitted 29 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 Camera-ready

  6. arXiv:2309.09783  [pdf, other

    cs.CL

    The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings

    Authors: Michal Mochtak, Peter Rupnik, Nikola Ljubešić

    Abstract: The paper presents a new training dataset of sentences in 7 languages, manually annotated for sentiment, which are used in a series of experiments focused on training a robust sentiment identifier for parliamentary proceedings. The paper additionally introduces the first domain-specific multilingual transformer language model for political science applications, which was additionally pre-trained o… ▽ More

    Submitted 20 March, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

  7. arXiv:2308.04255  [pdf, other

    cs.CL

    CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

    Authors: Luka Terčon, Nikola Ljubešić

    Abstract: We present CLASSLA-Stanza, a pipeline for automatic linguistic annotation of the South Slavic languages, which is based on the Stanza natural language processing pipeline. We describe the main improvements in CLASSLA-Stanza with respect to Stanza, and give a detailed description of the model training process for the latest 2.1 release of the pipeline. We also report performance scores produced by… ▽ More

    Submitted 11 August, 2023; v1 submitted 8 August, 2023; originally announced August 2023.

    Comments: 17 pages, 14 tables, 1 figure; Typos corrected

  8. arXiv:2305.20080  [pdf, other

    cs.CL

    Findings of the VarDial Evaluation Campaign 2023

    Authors: Noëmi Aepli, Çağrı Çöltekin, Rob Van Der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri

    Abstract: This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR),… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Journal ref: In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 251-261, Dubrovnik, Croatia. Association from Computational Linguistics

  9. arXiv:2303.03953  [pdf, other

    cs.CL cs.LG

    ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

    Authors: Taja Kuzman, Igor Mozetič, Nikola Ljubešić

    Abstract: ChatGPT has shown strong capabilities in natural language generation tasks, which naturally leads researchers to explore where its abilities end. In this paper, we examine whether ChatGPT can be used for zero-shot text classification, more specifically, automatic genre identification. We compare ChatGPT with a multilingual XLM-RoBERTa language model that was fine-tuned on datasets, manually annota… ▽ More

    Submitted 8 March, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

  10. arXiv:2206.00929  [pdf, other

    cs.CL

    The ParlaSent-BCS dataset of sentiment-annotated parliamentary debates from Bosnia-Herzegovina, Croatia, and Serbia

    Authors: Michal Mochtak, Peter Rupnik, Nikola Ljubešič

    Abstract: Expression of sentiment in parliamentary debates is deemed to be significantly different from that on social media or in product reviews. This paper adds to an emerging body of research on parliamentary debates with a dataset of sentences annotated for detection sentiment polarity in political discourse. We sample the sentences for annotation from the proceedings of three Southeast European parlia… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

    Comments: 8 pages, submitted to JT-DH 2022 (Language Technologies and Digital Humanities 2022) conference, number 4293

  11. arXiv:2203.08565  [pdf, other

    cs.CL

    Geographic Adaptation of Pretrained Language Models

    Authors: Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze

    Abstract: While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce ge… ▽ More

    Submitted 28 January, 2024; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: TACL 2024 (pre-MIT Press publication version)

  12. arXiv:2201.03857  [pdf, other

    cs.CL

    The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

    Authors: Taja Kuzman, Peter Rupnik, Nikola Ljubešić

    Abstract: This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1,125 crawled Slovenian web documents that consist of 650 thousand words. Each document was manually annotated for genre with a new annotation schema that builds upon existing schemata, having primarily clarity of labels and inter-annotator agreement in mind. The dataset consists of various chall… ▽ More

    Submitted 11 January, 2022; originally announced January 2022.

  13. Retweet communities reveal the main sources of hate speech

    Authors: Bojan Evkoski, Andraz Pelicon, Igor Mozetic, Nikola Ljubesic, Petra Kralj Novak

    Abstract: We address a challenging problem of identifying main sources of hate speech on Twitter. On one hand, we carefully annotate a large set of tweets for hate speech, and deploy advanced deep learning to produce high quality hate speech classification models. On the other hand, we create retweet networks, detect communities and monitor their evolution through time. This combined approach is applied to… ▽ More

    Submitted 17 March, 2022; v1 submitted 31 May, 2021; originally announced May 2021.

    Journal ref: B. Evkoski, A. Pelicon, I. Mozetič, N. Ljubešić, P. Kralj Novak. Retweet communities reveal the main sources of hate speech, PLoS ONE 17(3): e0265602, 2022

  14. Community evolution in retweet networks

    Authors: Bojan Evkoski, Igor Mozetic, Nikola Ljubesic, Petra Kralj Novak

    Abstract: Communities in social networks often reflect close social ties between their members and their evolution through time. We propose an approach that tracks two aspects of community evolution in retweet networks: flow of the members in, out and between the communities, and their influence. We start with high resolution time windows, and then select several timepoints which exhibit large differences b… ▽ More

    Submitted 2 September, 2021; v1 submitted 13 May, 2021; originally announced May 2021.

    Journal ref: PLoS ONE 16(9): e0256175, 2021

  15. arXiv:2104.09243  [pdf, ps, other

    cs.CL

    BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

    Authors: Nikola Ljubešić, Davor Lauc

    Abstract: In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of part-of-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense rea… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

  16. arXiv:1912.05320  [pdf, other

    cs.CL

    CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

    Authors: Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik

    Abstract: State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous… ▽ More

    Submitted 29 October, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    ACM Class: I.2.7

    Journal ref: Proceedings of the 12th Language Resources and Evaluation Conference (2020) 5878-5886

  17. arXiv:1906.02053  [pdf, other

    cs.CL

    KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning

    Authors: Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

    Abstract: This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts. Term candidates in the dataset were extracted via morphosyntactic patterns and annotated for their termness by four annotators. Experiments on the dataset show that most co-occurrence statistics, applied after morphosyntactic patterns and a frequency threshold, perform close to random… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

  18. arXiv:1906.02045  [pdf, other

    cs.CL

    The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

    Authors: Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

    Abstract: In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD). The main advantages of these datasets compared to the existing ones are identical sampling procedures, pr… ▽ More

    Submitted 13 June, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

  19. Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

    Authors: Nikola Ljubešić, Darja Fišer, Anita Peti-Stantić

    Abstract: The notions of concreteness and imageability, traditionally important in psycholinguistics, are gaining significance in semantic-oriented natural language processing tasks. In this paper we investigate the predictability of these two concepts via supervised learning, using word embeddings as explanatory variables. We perform predictions both within and across languages by exploiting collections of… ▽ More

    Submitted 8 July, 2018; originally announced July 2018.

  20. arXiv:1805.03122  [pdf, other

    cs.CL

    Bleaching Text: Abstract Features for Cross-lingual Gender Prediction

    Authors: Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malvina Nissim, Barbara Plank

    Abstract: Gender prediction has typically focused on lexical and social network features, yielding good performance, but making systems highly language-, topic-, and platform-dependent. Cross-lingual embeddings circumvent some of these limitations, but capture gender-specific style less. We propose an alternative: bleaching text, i.e., transforming lexical strings into more abstract features. This study pro… ▽ More

    Submitted 8 May, 2018; originally announced May 2018.

    Comments: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics