Skip to main content

Showing 1–33 of 33 results for author: Robnik-Šikonja, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2309.06089  [pdf, ps, other

    cs.CL cs.LG

    Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

    Authors: Boshko Koloski, Blaž Škrlj, Marko Robnik-Šikonja, Senja Pollak

    Abstract: The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual tr… ▽ More

    Submitted 15 April, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

  2. arXiv:2306.11518  [pdf, ps, other

    cs.CL

    One model to rule them all: ranking Slovene summarizers

    Authors: Aleš Žagar, Marko Robnik-Šikonja

    Abstract: Text summarization is an essential task in natural language processing, and researchers have developed various approaches over the years, ranging from rule-based systems to neural networks. However, there is no single model or approach that performs well on every type of text. We propose a system that recommends the most suitable summarization model for a given text. The proposed system employs a… ▽ More

    Submitted 7 August, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

  3. Detection of depression on social networks using transformers and ensembles

    Authors: Ilija Tavchioski, Marko Robnik-Šikonja, Senja Pollak

    Abstract: As the impact of technology on our lives is increasing, we witness increased use of social media that became an essential tool not only for communication but also for sharing information with community about our thoughts and feelings. This can be observed also for people with mental health disorders such as depression where they use social media for expressing their thoughts and asking for help. T… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

  4. Feature construction using explanations of individual predictions

    Authors: Boštjan Vouk, Matej Guid, Marko Robnik-Šikonja

    Abstract: Feature construction can contribute to comprehensibility and performance of machine learning models. Unfortunately, it usually requires exhaustive search in the attribute space or time-consuming human involvement to generate meaningful features. We propose a novel heuristic approach for reducing the search space based on aggregation of instance-based explanations of predictive models. The proposed… ▽ More

    Submitted 23 January, 2023; originally announced January 2023.

    Comments: 54 pages, 10 figures, 22 tables

    MSC Class: 68-04; 68T30; 97N80 ACM Class: I.2.6; I.5.2; I.6.5; G.3; G.4

    Journal ref: Engineering Applications of Artificial Intelligence 120 (2023) 105823

  5. arXiv:2211.09159  [pdf, ps, other

    cs.CL cs.AI

    Unified Question Answering in Slovene

    Authors: Katja Logar, Marko Robnik-Šikonja

    Abstract: Question answering is one of the most challenging tasks in language understanding. Most approaches are developed for English, while less-resourced languages are much less researched. We adapt a successful English question-answering approach, called UnifiedQA, to the less-resourced Slovene language. Our adaptation uses the encoder-decoder transformer SloT5 and mT5 models to handle four question-ans… ▽ More

    Submitted 16 November, 2022; originally announced November 2022.

    Comments: 4 pages,published in Proceedings of the 25th International Multiconference INFORMATION SOCIETY - IS 2012, Volume A -Slovenian Conference on Artificial Intelligence SCAI 2022, Ljubljana, 2022, pp. 23-26

    MSC Class: 68T50 ACM Class: I.2.7

  6. arXiv:2208.10228  [pdf, other

    cs.CL cs.LG q-bio.BM

    Review of Natural Language Processing in Pharmacology

    Authors: Dimitar Trajanov, Vangel Trajkovski, Makedonka Dimitrieva, Jovana Dobreva, Milos Jovanovik, Matej Klemen, Aleš Žagar, Marko Robnik-Šikonja

    Abstract: Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of th… ▽ More

    Submitted 26 January, 2023; v1 submitted 22 August, 2022; originally announced August 2022.

    Comments: 42 pages, 2 figures, 7 tables

    ACM Class: J.3; A.1

  7. arXiv:2207.13988  [pdf, ps, other

    cs.CL

    Sequence to sequence pretraining for a less-resourced Slovenian language

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summ… ▽ More

    Submitted 2 January, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

    Comments: 19 pages

  8. arXiv:2207.01054  [pdf, other

    cs.CL

    Multi-aspect Multilingual and Cross-lingual Parliamentary Speech Analysis

    Authors: Kristian Miok, Encarnacion Hidalgo-Tenorio, Petya Osenova, Miguel-Angel Benitez-Castro, Marko Robnik-Sikonja

    Abstract: Parliamentary and legislative debate transcripts provide informative insight into elected politicians' opinions, positions, and policy preferences. They are interesting for political and social sciences as well as linguistics and natural language processing (NLP) research. While existing research studied individual parliaments, we apply advanced NLP methods to a joint and comparative analysis of s… ▽ More

    Submitted 20 June, 2023; v1 submitted 3 July, 2022; originally announced July 2022.

  9. arXiv:2202.04994  [pdf, ps, other

    cs.CL

    Slovene SuperGLUE Benchmark: Translation and Evaluation

    Authors: Aleš Žagar, Marko Robnik-Šikonja

    Abstract: We present a Slovene combined machine-human translated SuperGLUE benchmark. We describe the translation process and problems arising due to differences in morphology and grammar. We evaluate the translated datasets in several modes: monolingual, cross-lingual, and multilingual, taking into account differences between machine and human translated training sets. The results show that the monolingual… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

    Comments: arXiv admin note: text overlap with arXiv:2107.10614

  10. arXiv:2112.10553  [pdf, other

    cs.CL

    Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate the… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

    Comments: 12 pages. To be published in proceedings of the AIST 2021 conference

  11. arXiv:2111.07119  [pdf, other

    cs.CL

    Extracting and filtering paraphrases by bridging natural language inference and paraphrasing

    Authors: Matej Klemen, Marko Robnik-Šikonja

    Abstract: Paraphrasing is a useful natural language processing task that can contribute to more diverse generated or translated texts. Natural language inference (NLI) and paraphrasing share some similarities and can benefit from a joint approach. We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets. Our approach is based on… ▽ More

    Submitted 13 November, 2021; originally announced November 2021.

  12. Knowledge Graph informed Fake News Classification via Heterogeneous Representation Ensembles

    Authors: Boshko Koloski, Timen Stepišnik-Perdih, Marko Robnik-Šikonja, Senja Pollak, Blaž Škrlj

    Abstract: Increasing amounts of freely available data both in textual and relational form offers exploration of richer document representations, potentially improving the model performance and robustness. An emerging problem in the modern era is fake news detection -- many easily available pieces of information are not necessarily factually correct, and can lead to wrong conclusions or are used for manipula… ▽ More

    Submitted 15 February, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

  13. arXiv:2107.10614  [pdf, ps, other

    cs.CL

    Evaluation of contextual embeddings on less-resourced languages

    Authors: Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž Repar, Senja Pollak, Matthew Purver, Marko Robnik-Šikonja

    Abstract: The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysi… ▽ More

    Submitted 22 July, 2021; originally announced July 2021.

    Comments: 45 pages

  14. Cross-lingual alignments of ELMo contextual embeddings

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To pro… ▽ More

    Submitted 22 July, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

    Comments: 30 pages, 5 figures

    Journal ref: Neural Computing and Applications, 2022

  15. arXiv:2012.04307  [pdf

    cs.CL cs.LG

    Cross-lingual Transfer of Abstractive Summarizer to Less-resource Language

    Authors: Aleš Žagar, Marko Robnik-Šikonja

    Abstract: Automatic text summarization extracts important information from texts and presents the information in the form of a summary. Abstractive summarization approaches progressed significantly by switching to deep neural networks, but results are not yet satisfactory, especially for languages where large training sets do not exist. In several natural language processing tasks, a cross-lingual model tra… ▽ More

    Submitted 2 September, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

  16. Enhancing deep neural networks with morphological information

    Authors: Matej Klemen, Luka Krsnik, Marko Robnik-Šikonja

    Abstract: Deep learning approaches are superior in NLP due to their ability to extract informative features and patterns from languages. The two most successful neural architectures are LSTM and transformers, used in large pretrained language models such as BERT. While cross-lingual approaches are on the rise, most current NLP techniques are designed and applied to English, and less-resourced languages are… ▽ More

    Submitted 1 March, 2022; v1 submitted 24 November, 2020; originally announced November 2020.

    Comments: Updated version, accepted to Natural Language Engineering

  17. arXiv:2010.14872  [pdf, other

    cs.CL stat.ML

    Bayesian Methods for Semi-supervised Text Annotation

    Authors: Kristian Miok, Gregor Pirs, Marko Robnik-Sikonja

    Abstract: Human annotations are an important source of information in the development of natural language understanding approaches. As under the pressure of productivity annotators can assign different labels to a given text, the quality of produced annotations frequently varies. This is especially the case if decisions are difficult, with high cognitive load, requires awareness of broader context, or caref… ▽ More

    Submitted 28 October, 2020; originally announced October 2020.

    Comments: Accepted for COLING 2020, The 14th Linguistic Annotation Workshop

  18. MICE: Mining Idioms with Contextual Embeddings

    Authors: Tadej Škvorc, Polona Gantar, Marko Robnik-Šikonja

    Abstract: Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach,… ▽ More

    Submitted 10 November, 2021; v1 submitted 13 August, 2020; originally announced August 2020.

  19. arXiv:2006.07890  [pdf, ps, other

    cs.CL

    FinEst BERT and CroSloEngual BERT: less is more in multilingual models

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian,… ▽ More

    Submitted 14 June, 2020; originally announced June 2020.

    Comments: 10 pages, accepted at TSD 2020 conference

    Journal ref: Proceedings of the 23rd Internetional Conference on Text, Speech, and Dialogue (TSD 2020), pages 104-111

  20. Propositionalization and Embeddings: Two Sides of the Same Coin

    Authors: Nada Lavrač, Blaž Škrlj, Marko Robnik-Šikonja

    Abstract: Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper outlines some of the modern data processing techniques used in relational learning that enable data fusion from different input data types and formats into a s… ▽ More

    Submitted 8 June, 2020; originally announced June 2020.

    Comments: Accepted in MLJ

  21. arXiv:2005.07456  [pdf

    cs.CL cs.LG

    Cross-lingual Transfer of Sentiment Classifiers

    Authors: Marko Robnik-Sikonja, Kristjan Reba, Igor Mozetic

    Abstract: Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by constructing a map** between vector spaces of two languages or learning a joint vector space for multiple languag… ▽ More

    Submitted 24 March, 2021; v1 submitted 15 May, 2020; originally announced May 2020.

    Comments: 18 pages, 8 tables

    MSC Class: 68T50 (Primary) ACM Class: I.2.7; J.4; K.4.2

  22. arXiv:2005.06173  [pdf

    cs.LG stat.ML

    Multiple Imputation for Biomedical Data using Monte Carlo Dropout Autoencoders

    Authors: Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, Daniela Zaharie

    Abstract: Due to complex experimental settings, missing values are common in biomedical data. To handle this issue, many methods have been proposed, from ignoring incomplete instances to various data imputation approaches. With the recent rise of deep neural networks, the field of missing data imputation has oriented towards modelling of the data distribution. This paper presents an approach based on Monte… ▽ More

    Submitted 13 May, 2020; originally announced May 2020.

  23. arXiv:2005.05716  [pdf, other

    cs.LG stat.ML

    AttViz: Online exploration of self-attention for transparent neural language modeling

    Authors: Blaž Škrlj, Nika Eržen, Shane Sheehan, Saturnino Luz, Marko Robnik-Šikonja, Senja Pollak

    Abstract: Neural language models are becoming the prevailing methodology for the tasks of query answering, text classification, disambiguation, completion and translation. Commonly comprised of hundreds of millions of parameters, these neural network models offer state-of-the-art performance at the cost of interpretability; humans are no longer capable of tracing and understanding how decisions are being ma… ▽ More

    Submitted 12 May, 2020; originally announced May 2020.

  24. arXiv:1912.05320  [pdf, other

    cs.CL

    CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

    Authors: Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik

    Abstract: State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous… ▽ More

    Submitted 29 October, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    ACM Class: I.2.7

    Journal ref: Proceedings of the 12th Language Resources and Evaluation Conference (2020) 5878-5886

  25. arXiv:1911.10049  [pdf, other

    cs.CL cs.LG

    High Quality ELMo Embeddings for Seven Less-Resourced Languages

    Authors: Matej Ulčar, Marko Robnik-Šikonja

    Abstract: Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the… ▽ More

    Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

    Comments: 8 pages, 3 figures, LREC2020 conference

    Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4731-4738

  26. arXiv:1911.10038  [pdf, ps, other

    cs.CL

    Multilingual Culture-Independent Word Analogy Datasets

    Authors: Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja

    Abstract: In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English,… ▽ More

    Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

    Comments: 7 pages, LREC2020 conference

    ACM Class: J.5

    Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4074-4080

  27. Prediction Uncertainty Estimation for Hate Speech Classification

    Authors: Kristian Miok, Dong Nguyen-Doan, Blaž Škrlj, Daniela Zaharie, Marko Robnik-Šikonja

    Abstract: As a result of social network popularity, in recent years, hate speech phenomenon has significantly increased. Due to its harmful effect on minority groups as well as on large communities, there is a pressing need for hate speech detection and filtering. However, automatic approaches shall not jeopardize free speech, so they shall accompany their decisions with explanations and assessment of uncer… ▽ More

    Submitted 12 December, 2019; v1 submitted 16 September, 2019; originally announced September 2019.

    Comments: The final authenticated publication is available online at https://doi.org/10.1007/978-3-030-31372-2_24

    Journal ref: Statistical Language and Speech Processing 2019 Proceedings

  28. arXiv:1909.05755  [pdf, other

    stat.ML cs.LG

    Generating Data using Monte Carlo Dropout

    Authors: Kristian Miok, Dong Nguyen-Doan, Daniela Zaharie, Marko Robnik-Šikonja

    Abstract: For many analytical problems the challenge is to handle huge amounts of available data. However, there are data science application areas where collecting information is difficult and costly, e.g., in the study of geological phenomena, rare diseases, faults in complex systems, insurance frauds, etc. In many such cases, generators of synthetic data with the same statistical and predictive propertie… ▽ More

    Submitted 16 September, 2019; v1 submitted 12 September, 2019; originally announced September 2019.

  29. Exploring the relations between net benefits of IT projects and CIOs' perception of quality of software development disciplines

    Authors: Damjan Vavpotič, Marko Robnik-Šikonja, Tomaž Hovelja

    Abstract: Software development enterprises are under consistent pressure to improve their management techniques and development processes. These are comprised of several disciplines like requirements acquisition, design, coding, testing, etc. that must be continuously improved and individually tailored to suit specific software development project. This paper presents an evaluation approach that enables the… ▽ More

    Submitted 12 August, 2019; originally announced August 2019.

    MSC Class: 68N99 ACM Class: D.2.9

    Journal ref: Business & Information Systems Engineering, 2019

  30. Supervised and Unsupervised Neural Approaches to Text Readability

    Authors: Matej Martinc, Senja Pollak, Marko Robnik-Šikonja

    Abstract: We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages and allows adaptation… ▽ More

    Submitted 11 March, 2021; v1 submitted 26 July, 2019; originally announced July 2019.

    Comments: 39 pages, published in Computational Linguistic Journal

  31. arXiv:1902.03964  [pdf, other

    cs.LG stat.ML

    Deep Node Ranking for Neuro-symbolic Structural Node Embedding and Classification

    Authors: Blaž Škrlj, Jan Kralj, Janez Konc, Marko Robnik-Šikonja, Nada Lavrač

    Abstract: Network node embedding is an active research subfield of complex network analysis. This paper contributes a novel approach to learning network node embeddings and direct node classification using a node ranking scheme coupled with an autoencoder-based neural network architecture. The main advantages of the proposed Deep Node Ranking (DNR) algorithm are competitive or better classification performa… ▽ More

    Submitted 30 August, 2021; v1 submitted 11 February, 2019; originally announced February 2019.

    Comments: Accepted for publication in IJIS

  32. arXiv:1406.4287  [pdf

    cs.AI stat.AP

    Identifying roles of clinical pharmacy with survey evaluation

    Authors: Andreja Čufar, Aleš Mrhar, Marko Robnik-Šikonja

    Abstract: The survey data sets are important sources of data and their successful exploitation is of key importance for informed policy-decision making. We present how a survey analysis approach initially developed for customer satisfaction research in marketing can be adapted for the introduction of clinical pharmacy services into hospital. We use two analytical approaches to extract relevant managerial co… ▽ More

    Submitted 17 June, 2014; originally announced June 2014.

    MSC Class: 68T37 ACM Class: I.2.1; I.2.6

  33. arXiv:1403.7308  [pdf, ps, other

    stat.ML cs.AI cs.LG

    Data Generators for Learning Systems Based on RBF Networks

    Authors: Marko Robnik-Šikonja

    Abstract: There are plenty of problems where the data available is scarce and expensive. We propose a generator of semi-artificial data with similar properties to the original data which enables development and testing of different data mining algorithms and optimization of their parameters. The generated data allow a large scale experimentation and simulations without danger of overfitting. The proposed ge… ▽ More

    Submitted 19 July, 2020; v1 submitted 28 March, 2014; originally announced March 2014.

    MSC Class: 62-07; 62H30; 97N80; 65C10 ACM Class: I.2.6; I.5.2; I.6.5; G.3; G.4

    Journal ref: IEEE Transaction on Neural Networks and Learning Systems, 27(5):926-938, 2016