Search | arXiv e-print repository

Measuring Catastrophic Forgetting in Cross-Lingual Transfer Paradigms: Exploring Tuning Strategies

Authors: Boshko Koloski, Blaž Škrlj, Marko Robnik-Šikonja, Senja Pollak

Abstract: The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual tr… ▽ More The cross-lingual transfer is a promising technique to solve tasks in less-resourced languages. In this empirical study, we compare two fine-tuning approaches combined with zero-shot and full-shot learning approaches for large language models in a cross-lingual setting. As fine-tuning strategies, we compare parameter-efficient adapter methods with fine-tuning of all parameters. As cross-lingual transfer strategies, we compare the intermediate-training (\textit{IT}) that uses each language sequentially and cross-lingual validation (\textit{CLV}) that uses a target language already in the validation phase of fine-tuning. We assess the success of transfer and the extent of catastrophic forgetting in a source language due to cross-lingual transfer, i.e., how much previously acquired knowledge is lost when we learn new information in a different language. The results on two different classification problems, hate speech detection and product reviews, each containing datasets in several languages, show that the \textit{IT} cross-lingual strategy outperforms \textit{CLV} for the target language. Our findings indicate that, in the majority of cases, the \textit{CLV} strategy demonstrates superior retention of knowledge in the base language (English) compared to the \textit{IT} strategy, when evaluating catastrophic forgetting in multiple cross-lingual transfers. △ Less

Submitted 15 April, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

arXiv:2306.11518 [pdf, ps, other]

One model to rule them all: ranking Slovene summarizers

Authors: Aleš Žagar, Marko Robnik-Šikonja

Abstract: Text summarization is an essential task in natural language processing, and researchers have developed various approaches over the years, ranging from rule-based systems to neural networks. However, there is no single model or approach that performs well on every type of text. We propose a system that recommends the most suitable summarization model for a given text. The proposed system employs a… ▽ More Text summarization is an essential task in natural language processing, and researchers have developed various approaches over the years, ranging from rule-based systems to neural networks. However, there is no single model or approach that performs well on every type of text. We propose a system that recommends the most suitable summarization model for a given text. The proposed system employs a fully connected neural network that analyzes the input content and predicts which summarizer should score the best in terms of ROUGE score for a given input. The meta-model selects among four different summarization models, developed for the Slovene language, using different properties of the input, in particular its Doc2Vec document representation. The four Slovene summarization models deal with different challenges associated with text summarization in a less-resourced language. We evaluate the proposed SloMetaSum model performance automatically and parts of it manually. The results show that the system successfully automates the step of manually selecting the best model. △ Less

Submitted 7 August, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:2305.05325 [pdf, ps, other]

doi 10.14746/amup.9788323241775

Detection of depression on social networks using transformers and ensembles

Authors: Ilija Tavchioski, Marko Robnik-Šikonja, Senja Pollak

Abstract: As the impact of technology on our lives is increasing, we witness increased use of social media that became an essential tool not only for communication but also for sharing information with community about our thoughts and feelings. This can be observed also for people with mental health disorders such as depression where they use social media for expressing their thoughts and asking for help. T… ▽ More As the impact of technology on our lives is increasing, we witness increased use of social media that became an essential tool not only for communication but also for sharing information with community about our thoughts and feelings. This can be observed also for people with mental health disorders such as depression where they use social media for expressing their thoughts and asking for help. This opens a possibility to automatically process social media posts and detect signs of depression. We build several large pre-trained language model based classifiers for depression detection from social media posts. Besides fine-tuning BERT, RoBERTA, BERTweet, and mentalBERT were also construct two types of ensembles. We analyze the performance of our models on two data sets of posts from social platforms Reddit and Twitter, and investigate also the performance of transfer learning across the two data sets. The results show that transformer ensembles improve over the single transformer-based classifiers. △ Less

Submitted 9 May, 2023; originally announced May 2023.

arXiv:2301.09631 [pdf, other]

doi 10.1016/j.engappai.2023.105823

Feature construction using explanations of individual predictions

Authors: Boštjan Vouk, Matej Guid, Marko Robnik-Šikonja

Abstract: Feature construction can contribute to comprehensibility and performance of machine learning models. Unfortunately, it usually requires exhaustive search in the attribute space or time-consuming human involvement to generate meaningful features. We propose a novel heuristic approach for reducing the search space based on aggregation of instance-based explanations of predictive models. The proposed… ▽ More Feature construction can contribute to comprehensibility and performance of machine learning models. Unfortunately, it usually requires exhaustive search in the attribute space or time-consuming human involvement to generate meaningful features. We propose a novel heuristic approach for reducing the search space based on aggregation of instance-based explanations of predictive models. The proposed Explainable Feature Construction (EFC) methodology identifies groups of co-occurring attributes exposed by popular explanation methods, such as IME and SHAP. We empirically show that reducing the search to these groups significantly reduces the time of feature construction using logical, relational, Cartesian, numerical, and threshold num-of-N and X-of-N constructive operators. An analysis on 10 transparent synthetic datasets shows that EFC effectively identifies informative groups of attributes and constructs relevant features. Using 30 real-world classification datasets, we show significant improvements in classification accuracy for several classifiers and demonstrate the feasibility of the proposed feature construction even for large datasets. Finally, EFC generated interpretable features on a real-world problem from the financial industry, which were confirmed by a domain expert. △ Less

Submitted 23 January, 2023; originally announced January 2023.

Comments: 54 pages, 10 figures, 22 tables

MSC Class: 68-04; 68T30; 97N80 ACM Class: I.2.6; I.5.2; I.6.5; G.3; G.4

Journal ref: Engineering Applications of Artificial Intelligence 120 (2023) 105823

arXiv:2211.09159 [pdf, ps, other]

Unified Question Answering in Slovene

Authors: Katja Logar, Marko Robnik-Šikonja

Abstract: Question answering is one of the most challenging tasks in language understanding. Most approaches are developed for English, while less-resourced languages are much less researched. We adapt a successful English question-answering approach, called UnifiedQA, to the less-resourced Slovene language. Our adaptation uses the encoder-decoder transformer SloT5 and mT5 models to handle four question-ans… ▽ More Question answering is one of the most challenging tasks in language understanding. Most approaches are developed for English, while less-resourced languages are much less researched. We adapt a successful English question-answering approach, called UnifiedQA, to the less-resourced Slovene language. Our adaptation uses the encoder-decoder transformer SloT5 and mT5 models to handle four question-answering formats: yes/no, multiple-choice, abstractive, and extractive. We use existing Slovene adaptations of four datasets, and machine translate the MCTest dataset. We show that a general model can answer questions in different formats at least as well as specialized models. The results are further improved using cross-lingual transfer from English. While we produce state-of-the-art results for Slovene, the performance still lags behind English. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 4 pages,published in Proceedings of the 25th International Multiconference INFORMATION SOCIETY - IS 2012, Volume A -Slovenian Conference on Artificial Intelligence SCAI 2022, Ljubljana, 2022, pp. 23-26

MSC Class: 68T50 ACM Class: I.2.7

arXiv:2208.10228 [pdf, other]

Review of Natural Language Processing in Pharmacology

Authors: Dimitar Trajanov, Vangel Trajkovski, Makedonka Dimitrieva, Jovana Dobreva, Milos Jovanovik, Matej Klemen, Aleš Žagar, Marko Robnik-Šikonja

Abstract: Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of th… ▽ More Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers. △ Less

Submitted 26 January, 2023; v1 submitted 22 August, 2022; originally announced August 2022.

Comments: 42 pages, 2 figures, 7 tables

ACM Class: J.3; A.1

arXiv:2207.13988 [pdf, ps, other]

Sequence to sequence pretraining for a less-resourced Slovenian language

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summ… ▽ More Large pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modelling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which includes masked language model but more naturally fits text generation tasks such as machine translation, summarization, question answering, text simplification, dialogue systems, etc. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages. In contrast, we trained two different sized T5-type sequence to sequence models for morphologically rich Slovene language with much less resources and analyzed their behavior on 11 tasks. Concerning classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model but are useful for the generative tasks. △ Less

Submitted 2 January, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

Comments: 19 pages

arXiv:2207.01054 [pdf, other]

Multi-aspect Multilingual and Cross-lingual Parliamentary Speech Analysis

Authors: Kristian Miok, Encarnacion Hidalgo-Tenorio, Petya Osenova, Miguel-Angel Benitez-Castro, Marko Robnik-Sikonja

Abstract: Parliamentary and legislative debate transcripts provide informative insight into elected politicians' opinions, positions, and policy preferences. They are interesting for political and social sciences as well as linguistics and natural language processing (NLP) research. While existing research studied individual parliaments, we apply advanced NLP methods to a joint and comparative analysis of s… ▽ More Parliamentary and legislative debate transcripts provide informative insight into elected politicians' opinions, positions, and policy preferences. They are interesting for political and social sciences as well as linguistics and natural language processing (NLP) research. While existing research studied individual parliaments, we apply advanced NLP methods to a joint and comparative analysis of six national parliaments (Bulgarian, Czech, French, Slovene, Spanish, and United Kingdom) between 2017 and 2020. We analyze emotions and sentiment in the transcripts from the ParlaMint dataset collection and assess if the age, gender, and political orientation of speakers can be detected from their speeches. The results show some commonalities and many surprising differences among the analyzed countries. △ Less

Submitted 20 June, 2023; v1 submitted 3 July, 2022; originally announced July 2022.

arXiv:2202.04994 [pdf, ps, other]

Slovene SuperGLUE Benchmark: Translation and Evaluation

Authors: Aleš Žagar, Marko Robnik-Šikonja

Abstract: We present a Slovene combined machine-human translated SuperGLUE benchmark. We describe the translation process and problems arising due to differences in morphology and grammar. We evaluate the translated datasets in several modes: monolingual, cross-lingual, and multilingual, taking into account differences between machine and human translated training sets. The results show that the monolingual… ▽ More We present a Slovene combined machine-human translated SuperGLUE benchmark. We describe the translation process and problems arising due to differences in morphology and grammar. We evaluate the translated datasets in several modes: monolingual, cross-lingual, and multilingual, taking into account differences between machine and human translated training sets. The results show that the monolingual Slovene SloBERTa model is superior to massively multilingual and trilingual BERT models, but these also show a good cross-lingual performance on certain tasks. The performance of Slovene models still lags behind the best English models. △ Less

Submitted 10 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:2107.10614

arXiv:2112.10553 [pdf, other]

Training dataset and dictionary sizes matter in BERT models: the case of Baltic languages

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate the… ▽ More Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations. △ Less

Submitted 20 December, 2021; originally announced December 2021.

Comments: 12 pages. To be published in proceedings of the AIST 2021 conference

arXiv:2111.07119 [pdf, other]

Extracting and filtering paraphrases by bridging natural language inference and paraphrasing

Authors: Matej Klemen, Marko Robnik-Šikonja

Abstract: Paraphrasing is a useful natural language processing task that can contribute to more diverse generated or translated texts. Natural language inference (NLI) and paraphrasing share some similarities and can benefit from a joint approach. We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets. Our approach is based on… ▽ More Paraphrasing is a useful natural language processing task that can contribute to more diverse generated or translated texts. Natural language inference (NLI) and paraphrasing share some similarities and can benefit from a joint approach. We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets. Our approach is based on bidirectional entailment; namely, if two sentences can be mutually entailed, they are paraphrases. We evaluate our approach using several large pretrained transformer language models in the monolingual and cross-lingual setting. The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets. △ Less

Submitted 13 November, 2021; originally announced November 2021.

arXiv:2110.10457 [pdf, other]

doi 10.1016/j.neucom.2022.01.096

Knowledge Graph informed Fake News Classification via Heterogeneous Representation Ensembles

Authors: Boshko Koloski, Timen Stepišnik-Perdih, Marko Robnik-Šikonja, Senja Pollak, Blaž Škrlj

Abstract: Increasing amounts of freely available data both in textual and relational form offers exploration of richer document representations, potentially improving the model performance and robustness. An emerging problem in the modern era is fake news detection -- many easily available pieces of information are not necessarily factually correct, and can lead to wrong conclusions or are used for manipula… ▽ More Increasing amounts of freely available data both in textual and relational form offers exploration of richer document representations, potentially improving the model performance and robustness. An emerging problem in the modern era is fake news detection -- many easily available pieces of information are not necessarily factually correct, and can lead to wrong conclusions or are used for manipulation. In this work we explore how different document representations, ranging from simple symbolic bag-of-words, to contextual, neural language model-based ones can be used for efficient fake news identification. One of the key contributions is a set of novel document representation learning methods based solely on knowledge graphs, i.e. extensive collections of (grounded) subject-predicate-object triplets. We demonstrate that knowledge graph-based representations already achieve competitive performance to conventionally accepted representation learners. Furthermore, when combined with existing, contextual representations, knowledge graph-based document representations can achieve state-of-the-art performance. To our knowledge this is the first larger-scale evaluation of how knowledge graph-based representations can be systematically incorporated into the process of fake news classification. △ Less

Submitted 15 February, 2022; v1 submitted 20 October, 2021; originally announced October 2021.

arXiv:2107.10614 [pdf, ps, other]

Evaluation of contextual embeddings on less-resourced languages

Authors: Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž Repar, Senja Pollak, Matthew Purver, Marko Robnik-Šikonja

Abstract: The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysi… ▽ More The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysis shows that monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task, where they are not competitive with ELMo models trained on large corpora. In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models. △ Less

Submitted 22 July, 2021; originally announced July 2021.

Comments: 45 pages

arXiv:2106.15986 [pdf, other]

doi 10.1007/s00521-022-07164-x

Cross-lingual alignments of ELMo contextual embeddings

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To pro… ▽ More Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for less-resourced languages. Cross-lingual embeddings map word embeddings from a less-resourced language to a resource-rich language so that a prediction model trained on data from the resource-rich language can also be used in the less-resourced language. To produce cross-lingual map**s of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context. We address this issue with a novel method for creating cross-lingual contextual alignment datasets. Based on that, we propose several cross-lingual map** methods for ELMo embeddings. The proposed linear map** methods use existing Vecmap and MUSE alignments on contextual ELMo embeddings. Novel nonlinear ELMoGAN map** methods are based on GANs and do not assume isomorphic embedding spaces. We evaluate the proposed map** methods on nine languages, using four downstream tasks: named entity recognition (NER), dependency parsing (DP), terminology alignment, and sentiment analysis. The ELMoGAN methods perform very well on the NER and terminology alignment tasks, with a lower cross-lingual loss for NER compared to the direct training on some languages. In DP and sentiment analysis, linear contextual alignment variants are more successful. △ Less

Submitted 22 July, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: 30 pages, 5 figures

Journal ref: Neural Computing and Applications, 2022

arXiv:2012.04307 [pdf]

Cross-lingual Transfer of Abstractive Summarizer to Less-resource Language

Authors: Aleš Žagar, Marko Robnik-Šikonja

Abstract: Automatic text summarization extracts important information from texts and presents the information in the form of a summary. Abstractive summarization approaches progressed significantly by switching to deep neural networks, but results are not yet satisfactory, especially for languages where large training sets do not exist. In several natural language processing tasks, a cross-lingual model tra… ▽ More Automatic text summarization extracts important information from texts and presents the information in the form of a summary. Abstractive summarization approaches progressed significantly by switching to deep neural networks, but results are not yet satisfactory, especially for languages where large training sets do not exist. In several natural language processing tasks, a cross-lingual model transfer is successfully applied in less-resource languages. For summarization, the cross-lingual model transfer was not attempted due to a non-reusable decoder side of neural models that cannot correct target language generation. In our work, we use a pre-trained English summarization model based on deep neural networks and sequence-to-sequence architecture to summarize Slovene news articles. We address the problem of inadequate decoder by using an additional language model for the evaluation of the generated text in target language. We test several cross-lingual summarization models with different amounts of target data for fine-tuning. We assess the models with automatic evaluation measures and conduct a small-scale human evaluation. Automatic evaluation shows that the summaries of our best cross-lingual model are useful and of quality similar to the model trained only in the target language. Human evaluation shows that our best model generates summaries with high accuracy and acceptable readability. However, similar to other abstractive models, our models are not perfect and may occasionally produce misleading or absurd content. △ Less

Submitted 2 September, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

arXiv:2011.12432 [pdf, other]

doi 10.1017/S1351324922000080

Enhancing deep neural networks with morphological information

Authors: Matej Klemen, Luka Krsnik, Marko Robnik-Šikonja

Abstract: Deep learning approaches are superior in NLP due to their ability to extract informative features and patterns from languages. The two most successful neural architectures are LSTM and transformers, used in large pretrained language models such as BERT. While cross-lingual approaches are on the rise, most current NLP techniques are designed and applied to English, and less-resourced languages are… ▽ More Deep learning approaches are superior in NLP due to their ability to extract informative features and patterns from languages. The two most successful neural architectures are LSTM and transformers, used in large pretrained language models such as BERT. While cross-lingual approaches are on the rise, most current NLP techniques are designed and applied to English, and less-resourced languages are lagging behind. In morphologically rich languages, information is conveyed through morphology, e.g., through affixes modifying stems of words. Existing neural approaches do not explicitly use the information on word morphology. We analyse the effect of adding morphological features to LSTM and BERT models. As a testbed, we use three tasks available in many less-resourced languages: named entity recognition (NER), dependency parsing (DP), and comment filtering (CF). We construct baselines involving LSTM and BERT models, which we adjust by adding additional input in the form of part of speech (POS) tags and universal features. We compare models across several languages from different language families. Our results suggest that adding morphological features has mixed effects depending on the quality of features and the task. The features improve the performance of LSTM-based models on the NER and DP tasks, while they do not benefit the performance on the CF task. For BERT-based models, the morphological features only improve the performance on DP when they are of high quality while not showing practical improvement when they are predicted. Even for high-quality features, the improvements are less pronounced in language-specific BERT variants compared to massively multilingual BERT models. As in NER and CF datasets manually checked features are not available, we only experiment with predicted features and find that they do not cause any practical improvement in performance. △ Less

Submitted 1 March, 2022; v1 submitted 24 November, 2020; originally announced November 2020.

Comments: Updated version, accepted to Natural Language Engineering

arXiv:2010.14872 [pdf, other]

Bayesian Methods for Semi-supervised Text Annotation

Authors: Kristian Miok, Gregor Pirs, Marko Robnik-Sikonja

Abstract: Human annotations are an important source of information in the development of natural language understanding approaches. As under the pressure of productivity annotators can assign different labels to a given text, the quality of produced annotations frequently varies. This is especially the case if decisions are difficult, with high cognitive load, requires awareness of broader context, or caref… ▽ More Human annotations are an important source of information in the development of natural language understanding approaches. As under the pressure of productivity annotators can assign different labels to a given text, the quality of produced annotations frequently varies. This is especially the case if decisions are difficult, with high cognitive load, requires awareness of broader context, or careful consideration of background knowledge. To alleviate the problem, we propose two semi-supervised methods to guide the annotation process: a Bayesian deep learning model and a Bayesian ensemble method. Using a Bayesian deep learning method, we can discover annotations that cannot be trusted and might require reannotation. A recently proposed Bayesian ensemble method helps us to combine the annotators' labels with predictions of trained models. According to the results obtained from three hate speech detection experiments, the proposed Bayesian methods can improve the annotations and prediction performance of BERT models. △ Less

Submitted 28 October, 2020; originally announced October 2020.

Comments: Accepted for COLING 2020, The 14th Linguistic Annotation Workshop

arXiv:2008.05759 [pdf, other]

doi 10.1016/j.knosys.2021.107606

MICE: Mining Idioms with Contextual Embeddings

Authors: Tadej Škvorc, Polona Gantar, Marko Robnik-Šikonja

Abstract: Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach,… ▽ More Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach, called MICE, that uses contextual embeddings for that purpose. We present a new dataset of multi-word expressions with literal and idiomatic meanings and use it to train a classifier based on two state-of-the-art contextual word embeddings: ELMo and BERT. We show that deep neural networks using both embeddings perform much better than existing approaches, and are capable of detecting idiomatic word use, even for expressions that were not present in the training set. We demonstrate cross-lingual transfer of developed models and analyze the size of the required dataset. △ Less

Submitted 10 November, 2021; v1 submitted 13 August, 2020; originally announced August 2020.

arXiv:2006.07890 [pdf, ps, other]

FinEst BERT and CroSloEngual BERT: less is more in multilingual models

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian,… ▽ More Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations △ Less

Submitted 14 June, 2020; originally announced June 2020.

Comments: 10 pages, accepted at TSD 2020 conference

Journal ref: Proceedings of the 23rd Internetional Conference on Text, Speech, and Dialogue (TSD 2020), pages 104-111

arXiv:2006.04410 [pdf, other]

doi 10.1007/s10994-020-05890-8

Propositionalization and Embeddings: Two Sides of the Same Coin

Authors: Nada Lavrač, Blaž Škrlj, Marko Robnik-Šikonja

Abstract: Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper outlines some of the modern data processing techniques used in relational learning that enable data fusion from different input data types and formats into a s… ▽ More Data preprocessing is an important component of machine learning pipelines, which requires ample time and resources. An integral part of preprocessing is data transformation into the format required by a given learning algorithm. This paper outlines some of the modern data processing techniques used in relational learning that enable data fusion from different input data types and formats into a single table data representation, focusing on the propositionalization and embedding data transformation approaches. While both approaches aim at transforming data into tabular data format, they use different terminology and task definitions, are perceived to address different goals, and are used in different contexts. This paper contributes a unifying framework that allows for improved understanding of these two data transformation techniques by presenting their unified definitions, and by explaining the similarities and differences between the two approaches as variants of a unified complex data transformation task. In addition to the unifying framework, the novelty of this paper is a unifying methodology combining propositionalization and embeddings, which benefits from the advantages of both in solving complex data transformation and learning tasks. We present two efficient implementations of the unifying methodology: an instance-based PropDRM approach, and a feature-based PropStar approach to data transformation and learning, together with their empirical evaluation on several relational problems. The results show that the new algorithms can outperform existing relational learners and can solve much larger problems. △ Less

Submitted 8 June, 2020; originally announced June 2020.

Comments: Accepted in MLJ

arXiv:2005.07456 [pdf]

Cross-lingual Transfer of Sentiment Classifiers

Authors: Marko Robnik-Sikonja, Kristjan Reba, Igor Mozetic

Abstract: Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by constructing a map** between vector spaces of two languages or learning a joint vector space for multiple languag… ▽ More Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by constructing a map** between vector spaces of two languages or learning a joint vector space for multiple languages. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We use cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between 13 languages. We focus on two transfer mechanisms that recently show superior transfer performance. The first mechanism uses the trained models whose input is the joint numerical space for many languages as implemented in the LASER library. The second mechanism uses large pretrained multilingual BERT language models. Our experiments show that the transfer of models between similar languages is sensible, even with no target language data. The performance of cross-lingual models obtained with the multilingual BERT and LASER library is comparable, and the differences are language-dependent. The transfer with CroSloEngual BERT, pretrained on only three languages, is superior on these and some closely related languages. △ Less

Submitted 24 March, 2021; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: 18 pages, 8 tables

MSC Class: 68T50 (Primary) ACM Class: I.2.7; J.4; K.4.2

arXiv:2005.06173 [pdf]

Multiple Imputation for Biomedical Data using Monte Carlo Dropout Autoencoders

Authors: Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, Daniela Zaharie

Abstract: Due to complex experimental settings, missing values are common in biomedical data. To handle this issue, many methods have been proposed, from ignoring incomplete instances to various data imputation approaches. With the recent rise of deep neural networks, the field of missing data imputation has oriented towards modelling of the data distribution. This paper presents an approach based on Monte… ▽ More Due to complex experimental settings, missing values are common in biomedical data. To handle this issue, many methods have been proposed, from ignoring incomplete instances to various data imputation approaches. With the recent rise of deep neural networks, the field of missing data imputation has oriented towards modelling of the data distribution. This paper presents an approach based on Monte Carlo dropout within (Variational) Autoencoders which offers not only very good adaptation to the distribution of the data but also allows generation of new data, adapted to each specific instance. The evaluation shows that the imputation error and predictive similarity can be improved with the proposed approach. △ Less

Submitted 13 May, 2020; originally announced May 2020.

arXiv:2005.05716 [pdf, other]

AttViz: Online exploration of self-attention for transparent neural language modeling

Authors: Blaž Škrlj, Nika Eržen, Shane Sheehan, Saturnino Luz, Marko Robnik-Šikonja, Senja Pollak

Abstract: Neural language models are becoming the prevailing methodology for the tasks of query answering, text classification, disambiguation, completion and translation. Commonly comprised of hundreds of millions of parameters, these neural network models offer state-of-the-art performance at the cost of interpretability; humans are no longer capable of tracing and understanding how decisions are being ma… ▽ More Neural language models are becoming the prevailing methodology for the tasks of query answering, text classification, disambiguation, completion and translation. Commonly comprised of hundreds of millions of parameters, these neural network models offer state-of-the-art performance at the cost of interpretability; humans are no longer capable of tracing and understanding how decisions are being made. The attention mechanism, introduced initially for the task of translation, has been successfully adopted for other language-related tasks. We propose AttViz, an online toolkit for exploration of self-attention---real values associated with individual text tokens. We show how existing deep learning pipelines can produce outputs suitable for AttViz, offering novel visualizations of the attention heads and their aggregations with minimal effort, online. We show on examples of news segments how the proposed system can be used to inspect and potentially better understand what a model has learned (or emphasized). △ Less

Submitted 12 May, 2020; originally announced May 2020.

arXiv:1912.05320 [pdf, other]

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context

Authors: Carlos Santos Armendariz, Matthew Purver, Matej Ulčar, Senja Pollak, Nikola Ljubešić, Marko Robnik-Šikonja, Mark Granroth-Wilding, Kristiina Vaik

Abstract: State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous… ▽ More State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous measures of meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended to fill this gap. Building on the standard pairwise similarity task of SimLex-999, it provides context-dependent similarity measures; covers not only discrete differences in word sense but more subtle, graded changes in meaning; and covers not only a well-resourced language (English) but a number of less-resourced languages. We define the task and evaluation metrics, outline the dataset collection methodology, and describe the status of the dataset so far. △ Less

Submitted 29 October, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

ACM Class: I.2.7

Journal ref: Proceedings of the 12th Language Resources and Evaluation Conference (2020) 5878-5886

arXiv:1911.10049 [pdf, other]

High Quality ELMo Embeddings for Seven Less-Resourced Languages

Authors: Matej Ulčar, Marko Robnik-Šikonja

Abstract: Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the… ▽ More Recent results show that deep neural networks using contextual embeddings significantly outperform non-contextual embeddings on a majority of text classification task. We offer precomputed embeddings from popular contextual ELMo model for seven languages: Croatian, Estonian, Finnish, Latvian, Lithuanian, Slovenian, and Swedish. We demonstrate that the quality of embeddings strongly depends on the size of training set and show that existing publicly available ELMo embeddings for listed languages shall be improved. We train new ELMo embeddings on much larger training sets and show their advantage over baseline non-contextual FastText embeddings. In evaluation, we use two benchmarks, the analogy task and the NER task. △ Less

Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: 8 pages, 3 figures, LREC2020 conference

Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4731-4738

arXiv:1911.10038 [pdf, ps, other]

Multilingual Culture-Independent Word Analogy Datasets

Authors: Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Dailidėnaitė, Marko Robnik-Šikonja

Abstract: In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English,… ▽ More In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We redesigned the original monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings. △ Less

Submitted 27 March, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: 7 pages, LREC2020 conference

ACM Class: J.5

Journal ref: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4074-4080

arXiv:1909.07158 [pdf, other]

doi 10.1007/978-3-030-31372-2_24

Prediction Uncertainty Estimation for Hate Speech Classification

Authors: Kristian Miok, Dong Nguyen-Doan, Blaž Škrlj, Daniela Zaharie, Marko Robnik-Šikonja

Abstract: As a result of social network popularity, in recent years, hate speech phenomenon has significantly increased. Due to its harmful effect on minority groups as well as on large communities, there is a pressing need for hate speech detection and filtering. However, automatic approaches shall not jeopardize free speech, so they shall accompany their decisions with explanations and assessment of uncer… ▽ More As a result of social network popularity, in recent years, hate speech phenomenon has significantly increased. Due to its harmful effect on minority groups as well as on large communities, there is a pressing need for hate speech detection and filtering. However, automatic approaches shall not jeopardize free speech, so they shall accompany their decisions with explanations and assessment of uncertainty. Thus, there is a need for predictive machine learning models that not only detect hate speech but also help users understand when texts cross the line and become unacceptable. The reliability of predictions is usually not addressed in text classification. We fill this gap by proposing the adaptation of deep neural networks that can efficiently estimate prediction uncertainty. To reliably detect hate speech, we use Monte Carlo dropout regularization, which mimics Bayesian inference within neural networks. We evaluate our approach using different text embedding methods. We visualize the reliability of results with a novel technique that aids in understanding the classification reliability and errors. △ Less

Submitted 12 December, 2019; v1 submitted 16 September, 2019; originally announced September 2019.

Comments: The final authenticated publication is available online at https://doi.org/10.1007/978-3-030-31372-2_24

Journal ref: Statistical Language and Speech Processing 2019 Proceedings

arXiv:1909.05755 [pdf, other]

Generating Data using Monte Carlo Dropout

Authors: Kristian Miok, Dong Nguyen-Doan, Daniela Zaharie, Marko Robnik-Šikonja

Abstract: For many analytical problems the challenge is to handle huge amounts of available data. However, there are data science application areas where collecting information is difficult and costly, e.g., in the study of geological phenomena, rare diseases, faults in complex systems, insurance frauds, etc. In many such cases, generators of synthetic data with the same statistical and predictive propertie… ▽ More For many analytical problems the challenge is to handle huge amounts of available data. However, there are data science application areas where collecting information is difficult and costly, e.g., in the study of geological phenomena, rare diseases, faults in complex systems, insurance frauds, etc. In many such cases, generators of synthetic data with the same statistical and predictive properties as the actual data allow efficient simulations and development of tools and applications. In this work, we propose the incorporation of Monte Carlo Dropout method within Autoencoder (MCD-AE) and Variational Autoencoder (MCD-VAE) as efficient generators of synthetic data sets. As the Variational Autoencoder (VAE) is one of the most popular generator techniques, we explore its similarities and differences to the proposed methods. We compare the generated data sets with the original data based on statistical properties, structural similarity, and predictive similarity. The results obtained show a strong similarity between the results of VAE, MCD-VAE and MCD-AE; however, the proposed methods are faster and can generate values similar to specific selected initial instances. △ Less

Submitted 16 September, 2019; v1 submitted 12 September, 2019; originally announced September 2019.

arXiv:1908.04070 [pdf]

doi 10.1007/s12599-019-00612-4

Exploring the relations between net benefits of IT projects and CIOs' perception of quality of software development disciplines

Authors: Damjan Vavpotič, Marko Robnik-Šikonja, Tomaž Hovelja

Abstract: Software development enterprises are under consistent pressure to improve their management techniques and development processes. These are comprised of several disciplines like requirements acquisition, design, coding, testing, etc. that must be continuously improved and individually tailored to suit specific software development project. This paper presents an evaluation approach that enables the… ▽ More Software development enterprises are under consistent pressure to improve their management techniques and development processes. These are comprised of several disciplines like requirements acquisition, design, coding, testing, etc. that must be continuously improved and individually tailored to suit specific software development project. This paper presents an evaluation approach that enables the enterprises to increase development process net benefits by improving disciplines' quality and increasing developers' satisfaction. Our approach builds on Kano's model of quality. Based on an empirical study of top 1000 enterprises from Slovenia we find that application of software development methodologies in individual development disciplines significantly relates to net benefits of IT projects. The results show that different types of Kano quality are present in individual disciplines. Enterprises should be cautious when altering must-be quality disciplines like testing or deployment as they can significantly disrupt the established routines, cause great dissatisfaction between developers and significantly reduce benefits. On the other hand, changing the attractive quality disciplines like requirements acquisition can notably increase developers' satisfaction and benefits but is less likely to disrupt the established routines. △ Less

Submitted 12 August, 2019; originally announced August 2019.

MSC Class: 68N99 ACM Class: D.2.9

Journal ref: Business & Information Systems Engineering, 2019

arXiv:1907.11779 [pdf, other]

doi 10.1162/coli_a_00398

Supervised and Unsupervised Neural Approaches to Text Readability

Authors: Matej Martinc, Senja Pollak, Marko Robnik-Šikonja

Abstract: We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages and allows adaptation… ▽ More We present a set of novel neural supervised and unsupervised approaches for determining the readability of documents. In the unsupervised setting, we leverage neural language models, whereas in the supervised setting, three different neural classification architectures are tested. We show that the proposed neural unsupervised approach is robust, transferable across languages and allows adaptation to a specific readability task and data set. By systematic comparison of several neural architectures on a number of benchmark and new labelled readability datasets in two languages, this study also offers a comprehensive analysis of different neural approaches to readability classification. We expose their strengths and weaknesses, compare their performance to current state-of-the-art classification approaches to readability, which in most cases still rely on extensive feature engineering, and propose possibilities for improvements. △ Less

Submitted 11 March, 2021; v1 submitted 26 July, 2019; originally announced July 2019.

Comments: 39 pages, published in Computational Linguistic Journal

arXiv:1902.03964 [pdf, other]

doi 10.1002/int.22651

Deep Node Ranking for Neuro-symbolic Structural Node Embedding and Classification

Authors: Blaž Škrlj, Jan Kralj, Janez Konc, Marko Robnik-Šikonja, Nada Lavrač

Abstract: Network node embedding is an active research subfield of complex network analysis. This paper contributes a novel approach to learning network node embeddings and direct node classification using a node ranking scheme coupled with an autoencoder-based neural network architecture. The main advantages of the proposed Deep Node Ranking (DNR) algorithm are competitive or better classification performa… ▽ More Network node embedding is an active research subfield of complex network analysis. This paper contributes a novel approach to learning network node embeddings and direct node classification using a node ranking scheme coupled with an autoencoder-based neural network architecture. The main advantages of the proposed Deep Node Ranking (DNR) algorithm are competitive or better classification performance, significantly higher learning speed and lower space requirements when compared to state-of-the-art approaches on 15 real-life node classification benchmarks. Furthermore, it enables exploration of the relationship between symbolic and the derived sub-symbolic node representations, offering insights into the learned node space structure. To avoid the space complexity bottleneck in a direct node classification setting, DNR computes stationary distributions of personalized random walks from given nodes in mini-batches, scaling seamlessly to larger networks. The scaling laws associated with DNR were also investigated on 1488 synthetic Erdős-Rényi networks, demonstrating its scalability to tens of millions of links. △ Less

Submitted 30 August, 2021; v1 submitted 11 February, 2019; originally announced February 2019.

Comments: Accepted for publication in IJIS

arXiv:1406.4287 [pdf]

Identifying roles of clinical pharmacy with survey evaluation

Authors: Andreja Čufar, Aleš Mrhar, Marko Robnik-Šikonja

Abstract: The survey data sets are important sources of data and their successful exploitation is of key importance for informed policy-decision making. We present how a survey analysis approach initially developed for customer satisfaction research in marketing can be adapted for the introduction of clinical pharmacy services into hospital. We use two analytical approaches to extract relevant managerial co… ▽ More The survey data sets are important sources of data and their successful exploitation is of key importance for informed policy-decision making. We present how a survey analysis approach initially developed for customer satisfaction research in marketing can be adapted for the introduction of clinical pharmacy services into hospital. We use two analytical approaches to extract relevant managerial consequences. With OrdEval algorithm we first evaluate the importance of competences for the users of clinical pharmacy and extract their nature according to the users expectations. Next, we build a model for predicting a successful introduction of clinical pharmacy to the clinical departments. We the wards with the highest probability of successful cooperation with a clinical pharmacist. We obtain useful managerially relevant information from a relatively small sample of highly relevant respondents. We show how the OrdEval algorithm exploits the information hidden in the ordering of class and attribute values and their inherent correlation. Its output can be effectively visualized and complemented with confidence intervals. △ Less

Submitted 17 June, 2014; originally announced June 2014.

MSC Class: 68T37 ACM Class: I.2.1; I.2.6

arXiv:1403.7308 [pdf, ps, other]

doi 10.1109/TNNLS.2015.2429711

Data Generators for Learning Systems Based on RBF Networks

Authors: Marko Robnik-Šikonja

Abstract: There are plenty of problems where the data available is scarce and expensive. We propose a generator of semi-artificial data with similar properties to the original data which enables development and testing of different data mining algorithms and optimization of their parameters. The generated data allow a large scale experimentation and simulations without danger of overfitting. The proposed ge… ▽ More There are plenty of problems where the data available is scarce and expensive. We propose a generator of semi-artificial data with similar properties to the original data which enables development and testing of different data mining algorithms and optimization of their parameters. The generated data allow a large scale experimentation and simulations without danger of overfitting. The proposed generator is based on RBF networks, which learn sets of Gaussian kernels. These Gaussian kernels can be used in a generative mode to generate new data from the same distributions. To assess quality of the generated data we evaluated the statistical properties of the generated data, structural similarity and predictive similarity using supervised and unsupervised learning techniques. To determine usability of the proposed generator we conducted a large scale evaluation using 51 UCI data sets. The results show a considerable similarity between the original and generated data and indicate that the method can be useful in several development and simulation scenarios. We analyze possible improvements in classification performance by adding different amounts of generated data to the training set, performance on high dimensional data sets, and conditions when the proposed approach is successful. △ Less

Submitted 19 July, 2020; v1 submitted 28 March, 2014; originally announced March 2014.

MSC Class: 62-07; 62H30; 97N80; 65C10 ACM Class: I.2.6; I.5.2; I.6.5; G.3; G.4

Journal ref: IEEE Transaction on Neural Networks and Learning Systems, 27(5):926-938, 2016

Showing 1–33 of 33 results for author: Robnik-Šikonja, M