Skip to main content

Showing 1–5 of 5 results for author: Kuzman, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.05428  [pdf, other

    cs.CL

    Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining

    Authors: Nikola Ljubešić, Vít Suchomel, Peter Rupnik, Taja Kuzman, Rik van Noord

    Abstract: The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigat… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  2. arXiv:2403.12721  [pdf, other

    cs.CL

    CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

    Authors: Nikola Ljubešić, Taja Kuzman

    Abstract: This paper presents a collection of highly comparable web corpora of Slovenian, Croatian, Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole spectrum of official languages in the South Slavic language space. The collection of these corpora comprises a total of 13 billion tokens of texts from 26 million documents. The comparability of the corpora is ensured by a co… ▽ More

    Submitted 26 March, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: Accepted to the LREC-COLING 2024 conference

    Journal ref: https://aclanthology.org/2024.lrec-main.291

  3. arXiv:2403.08693  [pdf, other

    cs.CL

    Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages

    Authors: Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljubešić, Miquel Esplà-Gomis, Gema Ramírez-Sánchez, Antonio Toral

    Abstract: Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant larg… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: Accepted to LREC-COLING 2024 (long)

  4. arXiv:2303.03953  [pdf, other

    cs.CL cs.LG

    ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

    Authors: Taja Kuzman, Igor Mozetič, Nikola Ljubešić

    Abstract: ChatGPT has shown strong capabilities in natural language generation tasks, which naturally leads researchers to explore where its abilities end. In this paper, we examine whether ChatGPT can be used for zero-shot text classification, more specifically, automatic genre identification. We compare ChatGPT with a multilingual XLM-RoBERTa language model that was fine-tuned on datasets, manually annota… ▽ More

    Submitted 8 March, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

  5. arXiv:2201.03857  [pdf, other

    cs.CL

    The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild

    Authors: Taja Kuzman, Peter Rupnik, Nikola Ljubešić

    Abstract: This paper presents a new training dataset for automatic genre identification GINCO, which is based on 1,125 crawled Slovenian web documents that consist of 650 thousand words. Each document was manually annotated for genre with a new annotation schema that builds upon existing schemata, having primarily clarity of labels and inter-annotator agreement in mind. The dataset consists of various chall… ▽ More

    Submitted 11 January, 2022; originally announced January 2022.