Search | arXiv e-print repository

Code-mixed Sentiment and Hate-speech Prediction

Authors: Anjali Yadav, Tanya Garg, Matej Klemen, Matej Ulcar, Basant Agarwal, Marko Robnik Sikonja

Abstract: Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant… ▽ More Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2101.11702 [pdf, other]

Better sampling in explanation methods can prevent dieselgate-like deception

Authors: Domen Vreš, Marko Robnik Šikonja

Abstract: Machine learning models are used in many sensitive areas where besides predictive accuracy their comprehensibility is also important. Interpretability of prediction models is necessary to determine their biases and causes of errors, and is a necessary prerequisite for users' confidence. For complex state-of-the-art black-box models post-hoc model-independent explanation techniques are an establish… ▽ More Machine learning models are used in many sensitive areas where besides predictive accuracy their comprehensibility is also important. Interpretability of prediction models is necessary to determine their biases and causes of errors, and is a necessary prerequisite for users' confidence. For complex state-of-the-art black-box models post-hoc model-independent explanation techniques are an established solution. Popular and effective techniques, such as IME, LIME, and SHAP, use perturbation of instance features to explain individual predictions. Recently, Slack et al. (2020) put their robustness into question by showing that their outcomes can be manipulated due to poor perturbation sampling employed. This weakness would allow dieselgate type cheating of owners of sensitive models who could deceive inspection and hide potentially unethical or illegal biases existing in their predictive models. This could undermine public trust in machine learning models and give rise to legal restrictions on their use. We show that better sampling in these explanation methods prevents malicious manipulations. The proposed sampling uses data generators that learn the training set distribution and generate new perturbation instances much more similar to the training set. We show that the improved sampling increases the robustness of the LIME and SHAP, while previously untested method IME is already the most robust of all. △ Less

Submitted 26 January, 2021; originally announced January 2021.

arXiv:2009.03091 [pdf, other]

Iterative Correction of Sensor Degradation and a Bayesian Multi-Sensor Data Fusion Method

Authors: Luka Kolar, Rok Šikonja, Lenart Treven

Abstract: We present a novel method for inferring ground-truth signal from multiple degraded signals, affected by different amounts of sensor exposure. The algorithm learns a multiplicative degradation effect by performing iterative corrections of two signals solely from the ratio between them. The degradation function d should be continuous, satisfy monotonicity, and d(0) = 1. We use smoothed monotonic reg… ▽ More We present a novel method for inferring ground-truth signal from multiple degraded signals, affected by different amounts of sensor exposure. The algorithm learns a multiplicative degradation effect by performing iterative corrections of two signals solely from the ratio between them. The degradation function d should be continuous, satisfy monotonicity, and d(0) = 1. We use smoothed monotonic regression method, where we easily incorporate the aforementioned criteria to the fitting part. We include theoretical analysis and prove convergence to the ground-truth signal for the noiseless measurement model. Lastly, we present an approach to fuse the noisy corrected signals using Gaussian processes. We use sparse Gaussian processes that can be utilized for a large number of measurements together with a specialized kernel that enables the estimation of noise values of all sensors. The data fusion framework naturally handles data gaps and provides a simple and powerful method for observing the signal trends on multiple timescales(long-term and short-term signal properties). The viability of correction method is evaluated on a synthetic dataset with known ground-truth signal. △ Less

Submitted 7 September, 2020; originally announced September 2020.

Showing 1–3 of 3 results for author: Šikonja, R