Search | arXiv e-print repository

RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian

Authors: Adrian Cosma, Bogdan Iordache, Paolo Rosso

Abstract: Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advan… ▽ More Recently, large language models (LLMs) have become increasingly powerful and have become capable of solving a plethora of tasks through proper instructions in natural language. However, the vast majority of testing suites assume that the instructions are written in English, the de facto prompting language. Code intelligence and problem solving still remain a difficult task, even for the most advanced LLMs. Currently, there are no datasets to measure the generalization power for code-generation models in a language other than English. In this work, we present RoCode, a competitive programming dataset, consisting of 2,642 problems written in Romanian, 11k solutions in C, C++ and Python and comprehensive testing suites for each problem. The purpose of RoCode is to provide a benchmark for evaluating the code intelligence of language models trained on Romanian / multilingual text as well as a fine-tuning set for pretrained Romanian models. Through our results and review of related works, we argue for the need to develop code models for languages other than English. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: Accepted at LREC-COLING 2024

arXiv:2402.01235 [pdf, other]

QSpeckleFilter: a Quantum Machine Learning approach for SAR speckle filtering

Authors: Francesco Mauro, Alessandro Sebastianelli, Maria Pia Del Rosso, Paolo Gamba, Silvia Liberata Ullo

Abstract: The use of Synthetic Aperture Radar (SAR) has greatly advanced our capacity for comprehensive Earth monitoring, providing detailed insights into terrestrial surface use and cover regardless of weather conditions, and at any time of day or night. However, SAR imagery quality is often compromised by speckle, a granular disturbance that poses challenges in producing accurate results without suitable… ▽ More The use of Synthetic Aperture Radar (SAR) has greatly advanced our capacity for comprehensive Earth monitoring, providing detailed insights into terrestrial surface use and cover regardless of weather conditions, and at any time of day or night. However, SAR imagery quality is often compromised by speckle, a granular disturbance that poses challenges in producing accurate results without suitable data processing. In this context, the present paper explores the cutting-edge application of Quantum Machine Learning (QML) in speckle filtering, harnessing quantum algorithms to address computational complexities. We introduce here QSpeckleFilter, a novel QML model for SAR speckle filtering. The proposed method compared to a previous work from the same authors showcases its superior performance in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) on a testing dataset, and it opens new avenues for Earth Observation (EO) applications. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: We have submitted this paper to IGARSS 2024

arXiv:2401.02746 [pdf, other]

Reading Between the Frames: Multi-Modal Depression Detection in Videos from Non-Verbal Cues

Authors: David Gimeno-Gómez, Ana-Maria Bucur, Adrian Cosma, Carlos-David Martínez-Hinarejos, Paolo Rosso

Abstract: Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-ve… ▽ More Depression, a prominent contributor to global disability, affects a substantial portion of the population. Efforts to detect depression from social media texts have been prevalent, yet only a few works explored depression detection from user-generated video content. In this work, we address this research gap by proposing a simple and flexible multi-modal temporal model capable of discerning non-verbal depression cues from diverse modalities in noisy, real-world videos. We show that, for in-the-wild videos, using additional high-level non-verbal cues is crucial to achieving good performance, and we extracted and processed audio speech embeddings, face emotion embeddings, face, body and hand landmarks, and gaze and blinking information. Through extensive experiments, we show that our model achieves state-of-the-art results on three key benchmark datasets for depression detection from video by a substantial margin. Our code is publicly available on GitHub. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: Accepted at 46th European Conference on Information Retrieval (ECIR 2024)

arXiv:2312.07228 [pdf]

Toxic language detection: a systematic review of Arabic datasets

Authors: Imene Bensalem, Paolo Rosso, Hanane Zitouni

Abstract: The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their correspondin… ▽ More The detection of toxic language in the Arabic language has emerged as an active area of research in recent years, and reviewing the existing datasets employed for training the developed solutions has become a pressing need. This paper offers a comprehensive survey of Arabic datasets focused on online toxic language. We systematically gathered a total of 54 available datasets and their corresponding papers and conducted a thorough analysis, considering 18 criteria across four primary dimensions: availability details, content, annotation process, and reusability. This analysis enabled us to identify existing gaps and make recommendations for future research works. For the convenience of the research community, the list of the analysed datasets is maintained in a GitHub repository (https://github.com/Imene1/Arabic-toxic-language). △ Less

Submitted 29 January, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

arXiv:2311.02025 [pdf, other]

Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection

Authors: Gretel Liz De la Peña Sarracén, Paolo Rosso, Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto

Abstract: Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques… ▽ More Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model. △ Less

Submitted 3 November, 2023; originally announced November 2023.

Comments: Accepted at EMNLP 2023 (Main Conference)

arXiv:2309.11285 [pdf, other]

Overview of AuTexTification at IberLEF 2023: Detection and Attribution of Machine-Generated Text in Multiple Domains

Authors: Areg Mikael Sarvazyan, José Ángel González, Marc Franco-Salvador, Francisco Rangel, Berta Chulvi, Paolo Rosso

Abstract: This paper presents the overview of the AuTexTification shared task as part of the IberLEF 2023 Workshop in Iberian Languages Evaluation Forum, within the framework of the SEPLN 2023 conference. AuTexTification consists of two subtasks: for Subtask 1, participants had to determine whether a text is human-authored or has been generated by a large language model. For Subtask 2, participants had to a… ▽ More This paper presents the overview of the AuTexTification shared task as part of the IberLEF 2023 Workshop in Iberian Languages Evaluation Forum, within the framework of the SEPLN 2023 conference. AuTexTification consists of two subtasks: for Subtask 1, participants had to determine whether a text is human-authored or has been generated by a large language model. For Subtask 2, participants had to attribute a machine-generated text to one of six different text generation models. Our AuTexTification 2023 dataset contains more than 160.000 texts across two languages (English and Spanish) and five domains (tweets, reviews, news, legal, and how-to articles). A total of 114 teams signed up to participate, of which 36 sent 175 runs, and 20 of them sent their working notes. In this overview, we present the AuTexTification dataset and task, the submitted participating systems, and the results. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted at SEPLN 2023

Journal ref: Procesamiento del Lenguaje Natural, [S.l.], v. 71, p. 275-288, sep. 2023

arXiv:2307.03377 [pdf, ps, other]

Mitigating Negative Transfer with Task Awareness for Sexism, Hate Speech, and Toxic Language Detection

Authors: Angel Felipe Magnossão de Paula, Paolo Rosso, Damiano Spina

Abstract: This paper proposes a novelty approach to mitigate the negative transfer problem. In the field of machine learning, the common strategy is to apply the Single-Task Learning approach in order to train a supervised model to solve a specific task. Training a robust model requires a lot of data and a significant amount of computational resources, making this solution unfeasible in cases where data are… ▽ More This paper proposes a novelty approach to mitigate the negative transfer problem. In the field of machine learning, the common strategy is to apply the Single-Task Learning approach in order to train a supervised model to solve a specific task. Training a robust model requires a lot of data and a significant amount of computational resources, making this solution unfeasible in cases where data are unavailable or expensive to gather. Therefore another solution, based on the sharing of information between tasks, has been developed: Multi-Task Learning (MTL). Despite the recent developments regarding MTL, the problem of negative transfer has still to be solved. Negative transfer is a phenomenon that occurs when noisy information is shared between tasks, resulting in a drop in performance. This paper proposes a new approach to mitigate the negative transfer problem based on the task awareness concept. The proposed approach results in diminishing the negative transfer together with an improvement of performance over classic MTL solution. Moreover, the proposed approach has been implemented in two unified architectures to detect Sexism, Hate Speech, and Toxic Language in text comments. The proposed architectures set a new state-of-the-art both in EXIST-2021 and HatEval-2019 benchmarks. △ Less

Submitted 7 July, 2023; originally announced July 2023.

Comments: 8 pages, 2 figures, 5 tables, IJCNN 2023 conference

arXiv:2303.09823 [pdf, other]

Transformers and Ensemble methods: A solution for Hate Speech Detection in Arabic languages

Authors: Angel Felipe Magnossão de Paula, Imene Bensalem, Paolo Rosso, Wajdi Zaghouani

Abstract: This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario, were obtained by using the ensemble approach based on t… ▽ More This paper describes our participation in the shared task of hate speech detection, which is one of the subtasks of the CERIST NLP Challenge 2022. Our experiments evaluate the performance of six transformer models and their combination using 2 ensemble approaches. The best results on the training set, in a five-fold cross validation scenario, were obtained by using the ensemble approach based on the majority vote. The evaluation of this approach on the test set resulted in an F1-score of 0.60 and an Accuracy of 0.86. △ Less

Submitted 17 March, 2023; originally announced March 2023.

Comments: 7 pages, 3 tables

arXiv:2301.05494 [pdf, other]

Multilingual Detection of Check-Worthy Claims using World Languages and Adapter Fusion

Authors: Ipek Baris Schlicht, Lucie Flek, Paolo Rosso

Abstract: Check-worthiness detection is the task of identifying claims, worthy to be investigated by fact-checkers. Resource scarcity for non-world languages and model learning costs remain major challenges for the creation of models supporting multilingual check-worthiness detection. This paper proposes cross-training adapters on a subset of world languages, combined by adapter fusion, to detect claims eme… ▽ More Check-worthiness detection is the task of identifying claims, worthy to be investigated by fact-checkers. Resource scarcity for non-world languages and model learning costs remain major challenges for the creation of models supporting multilingual check-worthiness detection. This paper proposes cross-training adapters on a subset of world languages, combined by adapter fusion, to detect claims emerging globally in multiple languages. (1) With a vast number of annotators available for world languages and the storage-efficient adapter models, this approach is more cost efficient. Models can be updated more frequently and thus stay up-to-date. (2) Adapter fusion provides insights and allows for interpretation regarding the influence of each adapter model on a particular language. The proposed solution often outperformed the top multilingual approaches in our benchmark tasks. △ Less

Submitted 13 January, 2023; originally announced January 2023.

Comments: 17 pages, 11 table. It has been accepted as a full paper at ECIR 2023

arXiv:2301.05453 [pdf, other]

It's Just a Matter of Time: Detecting Depression with Time-Enriched Multimodal Transformers

Authors: Ana-Maria Bucur, Adrian Cosma, Paolo Rosso, Liviu P. Dinu

Abstract: Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social… ▽ More Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social media mainly focus on text processing, and only a few also utilize images posted by users. In this work, we propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts, using pretrained models for extracting image and text embeddings. Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings. Moreover, we propose another model variant, which can operate on randomly sampled and unordered sets of posts to be more robust to dataset noise. We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets, obtaining state-of-the-art results of 0.931 F1 score on a popular multimodal Twitter dataset, and 0.902 F1 score on the only multimodal Reddit dataset. △ Less

Submitted 6 February, 2023; v1 submitted 13 January, 2023; originally announced January 2023.

Comments: Accepted at ECIR 2023

arXiv:2212.02352 [pdf, ps, other]

Fake News and Hate Speech: Language in Common

Authors: Berta Chulvi, Alejandro Toselli, Paolo Rosso

Abstract: In this paper we raise the research question of whether fake news and hate speech spreaders share common patterns in language. We compute a novel index, the ingroup vs outgroup index, in three different datasets and we show that both phenomena share an "us vs them" narrative. In this paper we raise the research question of whether fake news and hate speech spreaders share common patterns in language. We compute a novel index, the ingroup vs outgroup index, in three different datasets and we show that both phenomena share an "us vs them" narrative. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: 2 pages

arXiv:2207.12406 [pdf, ps, other]

UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu

Authors: Maaz Amjad, Grigori Sidorov, Alisa Zhila, Alexander Gelbukh, Paolo Rosso

Abstract: This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language. This is a binary classification task in which the goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing. The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and… ▽ More This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language. This is a binary classification task in which the goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing. The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business. 42 teams from 6 different countries (India, China, Egypt, Germany, Pakistan, and the UK) registered for the task. 9 teams submitted their experimental results. The participants used various machine learning methods ranging from feature-based traditional machine learning to neural network techniques. The best performing system achieved an F-score value of 0.90, showing that the BERT-based approach outperforms other machine learning classifiers. △ Less

Submitted 24 July, 2022; originally announced July 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2207.11893

arXiv:2207.11893 [pdf, other]

Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2020

Authors: Maaz Amjad, Grigori Sidorov, Alisa Zhila, Alexander Gelbukh, Paolo Rosso

Abstract: This overview paper describes the first shared task on fake news detection in Urdu language. The task was posed as a binary classification task, in which the goal is to differentiate between real and fake news. We provided a dataset divided into 900 annotated news articles for training and 400 news articles for testing. The dataset contained news in five domains: (i) Health, (ii) Sports, (iii) Sho… ▽ More This overview paper describes the first shared task on fake news detection in Urdu language. The task was posed as a binary classification task, in which the goal is to differentiate between real and fake news. We provided a dataset divided into 900 annotated news articles for training and 400 news articles for testing. The dataset contained news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business. 42 teams from 6 different countries (India, China, Egypt, Germany, Pakistan, and the UK) registered for the task. 9 teams submitted their experimental results. The participants used various machine learning methods ranging from feature-based traditional machine learning to neural networks techniques. The best performing system achieved an F-score value of 0.90, showing that the BERT-based approach outperforms other machine learning techniques △ Less

Submitted 24 July, 2022; originally announced July 2022.

arXiv:2207.05677 [pdf, other]

doi 10.1145/3547276.3548444

The OpenMP Cluster Programming Model

Authors: Hervé Yviquel, Marcio Pereira, Emílio Francesquini, Guilherme Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, Sandro Rigo, Alan Souza, Guido Araujo

Abstract: Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programmin… ▽ More Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application. △ Less

Submitted 13 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: 12 pages, 7 figures, 1 listing, to be published in the 51st International Conference on Parallel Processing Workshop Proceedings (ICPP Workshops 22)

ACM Class: D.4.1; D.3.2

arXiv:2207.00753 [pdf, other]

An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder

Authors: Ana-Maria Bucur, Adrian Cosma, Liviu P. Dinu, Paolo Rosso

Abstract: This work proposes a transformer architecture for user-level classification of gambling addiction and depression that is trainable end-to-end. As opposed to other methods that operate at the post level, we process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level. We exploit the fact that, by not inje… ▽ More This work proposes a transformer architecture for user-level classification of gambling addiction and depression that is trainable end-to-end. As opposed to other methods that operate at the post level, we process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level. We exploit the fact that, by not injecting positional encodings, multi-head attention is permutation invariant and we process randomly sampled sets of texts from a user after being encoded with a modern pretrained sentence encoder (RoBERTa / MiniLM). Moreover, our architecture is interpretable with modern feature attribution methods and allows for automatic dataset creation by identifying discriminating posts in a user's text-set. We perform ablation studies on hyper-parameters and evaluate our method for the eRisk 2022 Lab on early detection of signs of pathological gambling and early risk detection of depression. The method proposed by our team BLUE obtained the best ERDE5 score of 0.015, and the second-best ERDE50 score of 0.009 for pathological gambling detection. For the early detection of depression, we obtained the second-best ERDE50 of 0.027. △ Less

Submitted 2 July, 2022; originally announced July 2022.

arXiv:2206.06320 [pdf, other]

Cryptocurrency Bubble Detection: A New Stock Market Dataset, Financial Task & Hyperbolic Models

Authors: Ramit Sawhney, Shivam Agarwal, Vivek Mittal, Paolo Rosso, Vikram Nanda, Sudheer Chava

Abstract: The rapid spread of information over social media influences quantitative trading and investments. The growing popularity of speculative trading of highly volatile assets such as cryptocurrencies and meme stocks presents a fresh challenge in the financial realm. Investigating such "bubbles" - periods of sudden anomalous behavior of markets are critical in better understanding investor behavior and… ▽ More The rapid spread of information over social media influences quantitative trading and investments. The growing popularity of speculative trading of highly volatile assets such as cryptocurrencies and meme stocks presents a fresh challenge in the financial realm. Investigating such "bubbles" - periods of sudden anomalous behavior of markets are critical in better understanding investor behavior and market dynamics. However, high volatility coupled with massive volumes of chaotic social media texts, especially for underexplored assets like cryptocoins pose a challenge to existing methods. Taking the first step towards NLP for cryptocoins, we present and publicly release CryptoBubbles, a novel multi-span identification task for bubble detection, and a dataset of more than 400 cryptocoins from 9 exchanges over five years spanning over two million tweets. Further, we develop a set of sequence-to-sequence hyperbolic models suited to this multi-span identification task based on the power-law dynamics of cryptocurrencies and user behavior on social media. We further test the effectiveness of our models under zero-shot settings on a test set of Reddit posts pertaining to 29 "meme stocks", which see an increase in trade volume due to social media hype. Through quantitative, qualitative, and zero-shot analyses on Reddit and Twitter spanning cryptocoins and meme-stocks, we show the practical applicability of CryptoBubbles and hyperbolic models. △ Less

Submitted 11 May, 2022; originally announced June 2022.

Comments: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

arXiv:2205.06181 [pdf, other]

FACTOID: A New Dataset for Identifying Misinformation Spreaders and Political Bias

Authors: Flora Sakketou, Joan Plepi, Riccardo Cervero, Henri-Jacques Geiss, Paolo Rosso, Lucie Flek

Abstract: Proactively identifying misinformation spreaders is an important step towards mitigating the impact of fake news on our society. In this paper, we introduce a new contemporary Reddit dataset for fake news spreader analysis, called FACTOID, monitoring political discussions on Reddit since the beginning of 2020. The dataset contains over 4K users with 3.4M Reddit posts, and includes, beyond the user… ▽ More Proactively identifying misinformation spreaders is an important step towards mitigating the impact of fake news on our society. In this paper, we introduce a new contemporary Reddit dataset for fake news spreader analysis, called FACTOID, monitoring political discussions on Reddit since the beginning of 2020. The dataset contains over 4K users with 3.4M Reddit posts, and includes, beyond the users' binary labels, also their fine-grained credibility level (very low to very high) and their political bias strength (extreme right to extreme left). As far as we are aware, this is the first fake news spreader dataset that simultaneously captures both the long-term context of users' historical posts and the interactions between them. To create the first benchmark on our data, we provide methods for identifying misinformation spreaders by utilizing the social connections between the users along with their psycho-linguistic features. We show that the users' social interactions can, on their own, indicate misinformation spreading, while the psycho-linguistic features are mostly informative in non-neural classification settings. In a qualitative analysis, we observe that detecting affective mental processes correlates negatively with right-biased users, and that the openness to experience factor is lower for those who spread fake news. △ Less

Submitted 11 May, 2022; originally announced May 2022.

Comments: Accepted to LREC 2022

arXiv:2204.10841 [pdf, other]

Detecting early signs of depression in the conversational domain: The role of transfer learning in low-resource scenarios

Authors: Petr Lorenc, Ana-Sabina Uban, Paolo Rosso, Jan Šedivý

Abstract: The high prevalence of depression in society has given rise to the need for new digital tools to assist in its early detection. To this end, existing research has mainly focused on detecting depression in the domain of social media, where there is a sufficient amount of data. However, with the rise of conversational agents like Siri or Alexa, the conversational domain is becoming more critical. Un… ▽ More The high prevalence of depression in society has given rise to the need for new digital tools to assist in its early detection. To this end, existing research has mainly focused on detecting depression in the domain of social media, where there is a sufficient amount of data. However, with the rise of conversational agents like Siri or Alexa, the conversational domain is becoming more critical. Unfortunately, there is a lack of data in the conversational domain. We perform a study focusing on domain adaptation from social media to the conversational domain. Our approach mainly exploits the linguistic information preserved in the vector representation of text. We describe transfer learning techniques to classify users who suffer from early signs of depression with high recall. We achieve state-of-the-art results on a commonly used conversational dataset, and we highlight how the method can easily be used in conversational agents. We publicly release all source code. △ Less

Submitted 22 April, 2022; originally announced April 2022.

Comments: Accepted to The 27th International Conference on Natural Language & Information Systems (NLDB) 2022

arXiv:2204.09481 [pdf, other]

Unsupervised Ranking and Aggregation of Label Descriptions for Zero-Shot Classifiers

Authors: Angelo Basile, Marc Franco-Salvador, Paolo Rosso

Abstract: Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature o… ▽ More Zero-shot text classifiers based on label descriptions embed an input text and a set of labels into the same space: measures such as cosine similarity can then be used to select the most similar label description to the input text as the predicted label. In a true zero-shot setup, designing good label descriptions is challenging because no development set is available. Inspired by the literature on Learning with Disagreements, we look at how probabilistic models of repeated rating analysis can be used for selecting the best label descriptions in an unsupervised fashion. We evaluate our method on a set of diverse datasets and tasks (sentiment, topic and stance). Furthermore, we show that multiple, noisy label descriptions can be aggregated to boost the performance. △ Less

Submitted 24 May, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

Comments: 6 pages, 2 figures

MSC Class: I.2.7

arXiv:2112.06080 [pdf, other]

UPV at TREC Health Misinformation Track 2021 Ranking with SBERT and Quality Estimators

Authors: Ipek Baris Schlicht, Angel Felipe Magnossão de Paula, Paolo Rosso

Abstract: Health misinformation on search engines is a significant problem that could negatively affect individuals or public health. To mitigate the problem, TREC organizes a health misinformation track. This paper presents our submissions to this track. We use a BM25 and a domain-specific semantic search engine for retrieving initial documents. Later, we examine a health news schema for quality assessment… ▽ More Health misinformation on search engines is a significant problem that could negatively affect individuals or public health. To mitigate the problem, TREC organizes a health misinformation track. This paper presents our submissions to this track. We use a BM25 and a domain-specific semantic search engine for retrieving initial documents. Later, we examine a health news schema for quality assessment and apply it to re-rank documents. We merge the scores from the different components by using reciprocal rank fusion. Finally, we discuss the results and conclude with future works. △ Less

Submitted 11 December, 2021; originally announced December 2021.

Comments: 6 pages; presented at the TREC 2021

arXiv:2109.09232 [pdf, other]

UPV at CheckThat! 2021: Mitigating Cultural Differences for Identifying Multilingual Check-worthy Claims

Authors: Ipek Baris Schlicht, Angel Felipe Magnossão de Paula, Paolo Rosso

Abstract: Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is che… ▽ More Identifying check-worthy claims is often the first step of automated fact-checking systems. Tackling this task in a multilingual setting has been understudied. Encoding inputs with multilingual text representations could be one approach to solve the multilingual check-worthiness detection. However, this approach could suffer if cultural bias exists within the communities on determining what is check-worthy.In this paper, we propose a language identification task as an auxiliary task to mitigate unintended bias.With this purpose, we experiment joint training by using the datasets from CLEF-2021 CheckThat!, that contain tweets in English, Arabic, Bulgarian, Spanish and Turkish. Our results show that joint training of language identification and check-worthy claim detection tasks can provide performance gains for some of the selected languages. △ Less

Submitted 19 September, 2021; originally announced September 2021.

Comments: 11 pages, 2 figures. Link to the original paper: http://ceur-ws.org/Vol-2936/paper-36.pdf

ACM Class: I.7; J.4

Journal ref: published at CLEF 2021

arXiv:2109.07909 [pdf, other]

doi 10.1016/j.cosrev.2022.100531

Studying Fake News Spreading, Polarisation Dynamics, and Manipulation by Bots: a Tale of Networks and Language

Authors: Giancarlo Ruffo, Alfonso Semeraro, Anastasia Giachanou, Paolo Rosso

Abstract: With the explosive growth of online social media, the ancient problem of information disorders interfering with news diffusion has surfaced with a renewed intensity threatening our democracies, public health, and news outlets' credibility. Therefore, thousands of scientific papers have been published in a relatively short period, making researchers of different disciplines struggle with an informa… ▽ More With the explosive growth of online social media, the ancient problem of information disorders interfering with news diffusion has surfaced with a renewed intensity threatening our democracies, public health, and news outlets' credibility. Therefore, thousands of scientific papers have been published in a relatively short period, making researchers of different disciplines struggle with an information overload problem. The aim of this survey is threefold: (1) we present the results of a network-based analysis of the existing multidisciplinary literature to support the search for relevant trends and central publications; (2) we describe the main results and necessary background to attack the problem under a computational perspective; (3) we review selected contributions using network science as a unifying framework and computational linguistics as the tool to make sense of the shared content. Despite scholars working on computational linguistics and networks traditionally belong to different scientific communities, we expect that those interested in the area of fake news should be aware of crucial aspects of both disciplines. △ Less

Submitted 14 January, 2023; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: 43 pages, 9 figures

ACM Class: A.1; J.4; G.2; K.4; I.2.7

Journal ref: Computer Science Review, Volume 47, 2023, 100531, ISSN 1574-0137

arXiv:2106.15281 [pdf, other]

On Board Volcanic Eruption Detection through CNNs and Satellite Multispectral Imagery

Authors: Maria Pia Del Rosso, Alessandro Sebastianelli, Dario Spiller, Pierre Philippe Mathieu, Silvia Liberata Ullo

Abstract: In recent years, the growth of Machine Learning (ML) algorithms has raised the number of studies including their applicability in a variety of different scenarios. Among all, one of the hardest ones is the aerospace, due to its peculiar physical requirements. In this context, a feasibility study and a first prototype for an Artificial Intelligence (AI) model to be deployed on board satellites are… ▽ More In recent years, the growth of Machine Learning (ML) algorithms has raised the number of studies including their applicability in a variety of different scenarios. Among all, one of the hardest ones is the aerospace, due to its peculiar physical requirements. In this context, a feasibility study and a first prototype for an Artificial Intelligence (AI) model to be deployed on board satellites are presented in this work. As a case study, the detection of volcanic eruptions has been investigated as a method to swiftly produce alerts and allow immediate interventions. Two Convolutional Neural Networks (CNNs) have been proposed and designed, showing how to efficiently implement them for identifying the eruptions and at the same time adapting their complexity in order to fit on board requirements. △ Less

Submitted 28 July, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

arXiv:2106.12226 [pdf, other]

Spatio-Temporal SAR-Optical Data Fusion for Cloud Removal via a Deep Hierarchical Model

Authors: Alessandro Sebastianelli, Artur Nowakowski, Erika Puglisi, Maria Pia Del Rosso, Jamila Mifdal, Fiora Pirri, Pierre Philippe Mathieu, Silvia Liberata Ullo

Abstract: Cloud removal is a relevant topic in Remote Sensing as it fosters the usability of high-resolution optical images for Earth monitoring and study. Related techniques have been analyzed for years with a progressively clearer view of the appropriate methods to adopt, from multi-spectral to inpainting methods. Recent applications of deep generative models and sequence-to-sequence-based models have pro… ▽ More Cloud removal is a relevant topic in Remote Sensing as it fosters the usability of high-resolution optical images for Earth monitoring and study. Related techniques have been analyzed for years with a progressively clearer view of the appropriate methods to adopt, from multi-spectral to inpainting methods. Recent applications of deep generative models and sequence-to-sequence-based models have proved their capability to advance the field significantly. Nevertheless, there are still some gaps, mostly related to the amount of cloud coverage, the density and thickness of clouds, and the occurred temporal landscape changes. In this work, we fill some of these gaps by introducing a novel multi-modal method that uses different sources of information, both spatial and temporal, to restore the whole optical scene of interest. The proposed method introduces an innovative deep model, using the outcomes of both temporal-sequence blending and direct translation from Synthetic Aperture Radar (SAR) to optical images to obtain a pixel-wise restoration of the whole scene. The advantage of our approach is demonstrated across a variety of atmospheric conditions tested on a dataset we have generated and made available. Quantitative and qualitative results prove that the proposed method obtains cloud-free images, preserving scene details without resorting to a huge portion of a clean image and co** with landscape changes. △ Less

Submitted 28 March, 2022; v1 submitted 23 June, 2021; originally announced June 2021.

arXiv:2106.11056 [pdf, other]

Paradigm selection for Data Fusion of SAR and Multispectral Sentinel data applied to Land-Cover Classification

Authors: Alessandro Sebastianelli, Maria Pia Del Rosso, Pierre Philippe Mathieu, Silvia Liberata Ullo

Abstract: Data fusion is a well-known technique, becoming more and more popular in the Artificial Intelligence for Earth Observation (AI4EO) domain mainly due to its ability of reinforcing AI4EO applications by combining multiple data sources and thus bringing better results. On the other hand, like other methods for satellite data analysis, data fusion itself is also benefiting and evolving thanks to the i… ▽ More Data fusion is a well-known technique, becoming more and more popular in the Artificial Intelligence for Earth Observation (AI4EO) domain mainly due to its ability of reinforcing AI4EO applications by combining multiple data sources and thus bringing better results. On the other hand, like other methods for satellite data analysis, data fusion itself is also benefiting and evolving thanks to the integration of Artificial Intelligence (AI). In this letter, four data fusion paradigms, based on Convolutional Neural Networks (CNNs), are analyzed and implemented. The goals are to provide a systematic procedure for choosing the best data fusion framework, resulting in the best classification results, once the basic structure for the CNN has been defined, and to help interested researchers in their work when data fusion applied to remote sensing is involved. The procedure has been validated for land-cover classification but it can be transferred to other cases. △ Less

Submitted 18 June, 2021; originally announced June 2021.

Comments: This work has been submitted to the IEEE Geoscience and Remote Sensing Letters for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2104.09350 [pdf, other]

A speckle filter for Sentinel-1 SAR Ground Range Detected data based on Residual Convolutional Neural Networks

Authors: Alessandro Sebastianelli, Maria Pia Del Rosso, Silvia Liberata Ullo, Paolo Gamba

Abstract: In recent years, machine learning (ML) algorithms have become widespread in all the fields of remote sensing (RS) and earth observation (EO). This has allowed the rapid development of new procedures to solve problems affecting these sectors. In this context, this work aims at presenting a novel method for filtering speckle noise from Sentinel-1 ground range detected (GRD) data by applying deep lea… ▽ More In recent years, machine learning (ML) algorithms have become widespread in all the fields of remote sensing (RS) and earth observation (EO). This has allowed the rapid development of new procedures to solve problems affecting these sectors. In this context, this work aims at presenting a novel method for filtering speckle noise from Sentinel-1 ground range detected (GRD) data by applying deep learning (DL) algorithms, based on convolutional neural networks (CNNs). The paper provides an easy yet very effective approach to extract the large amount of training data needed for DL approaches in this challenging case. The experimental results on simulated speckled images and an actual SAR dataset show a clear improvement with respect to the state of the art in terms of peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), equivalent number of looks (ENL), proving the effectiveness of the proposed architecture. △ Less

Submitted 17 May, 2022; v1 submitted 19 April, 2021; originally announced April 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2101.09810 [pdf, other]

FakeFlow: Fake News Detection by Modeling the Flow of Affective Information

Authors: Bilal Ghanem, Simone Paolo Ponzetto, Paolo Rosso, Francisco Rangel

Abstract: Fake news articles often stir the readers' attention by means of emotional appeals that arouse their feelings. Unlike in short news texts, authors of longer articles can exploit such affective factors to manipulate readers by adding exaggerations or fabricating events, in order to affect the readers' emotions. To capture this, we propose in this paper to model the flow of affective information in… ▽ More Fake news articles often stir the readers' attention by means of emotional appeals that arouse their feelings. Unlike in short news texts, authors of longer articles can exploit such affective factors to manipulate readers by adding exaggerations or fabricating events, in order to affect the readers' emotions. To capture this, we propose in this paper to model the flow of affective information in fake news articles using a neural architecture. The proposed model, FakeFlow, learns this flow by combining topic and affective information extracted from text. We evaluate the model's performance with several experiments on four real-world datasets. The results show that FakeFlow achieves superior results when compared against state-of-the-art methods, thus confirming the importance of capturing the flow of the affective information in news articles. △ Less

Submitted 24 January, 2021; originally announced January 2021.

Comments: 9 pages, 6 figures, EACL-2021

arXiv:2101.07598 [pdf, other]

Analysis and tuning of hierarchical topic models based on Renyi entropy approach

Authors: Sergei Koltcov, Vera Ignatenko, Maxim Terpilovskii, Paolo Rosso

Abstract: Hierarchical topic modeling is a potentially powerful instrument for determining the topical structure of text collections that allows constructing a topical hierarchy representing levels of topical abstraction. However, tuning of parameters of hierarchical models, including the number of topics on each hierarchical level, remains a challenging task and an open issue. In this paper, we propose a R… ▽ More Hierarchical topic modeling is a potentially powerful instrument for determining the topical structure of text collections that allows constructing a topical hierarchy representing levels of topical abstraction. However, tuning of parameters of hierarchical models, including the number of topics on each hierarchical level, remains a challenging task and an open issue. In this paper, we propose a Renyi entropy-based approach for a partial solution to the above problem. First, we propose a Renyi entropy-based metric of quality for hierarchical models. Second, we propose a practical concept of hierarchical topic model tuning tested on datasets with human mark-up. In the numerical experiments, we consider three different hierarchical models, namely, hierarchical latent Dirichlet allocation (hLDA) model, hierarchical Pachinko allocation model (hPAM), and hierarchical additive regularization of topic models (hARTM). We demonstrate that hLDA model possesses a significant level of instability and, moreover, the derived numbers of topics are far away from the true numbers for labeled datasets. For hPAM model, the Renyi entropy approach allows us to determine only one level of the data structure. For hARTM model, the proposed approach allows us to estimate the number of topics for two hierarchical levels. △ Less

Submitted 19 January, 2021; originally announced January 2021.

arXiv:2011.05706 [pdf, ps, other]

Multilingual Irony Detection with Dependency Syntax and Neural Models

Authors: Alessandra Teresa Cignarella, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, Paolo Rosso, Farah Benamara

Abstract: This paper presents an in-depth investigation of the effectiveness of dependency-based syntactic features on the irony detection task in a multilingual perspective (English, Spanish, French and Italian). It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme. Three distinct experimental setti… ▽ More This paper presents an in-depth investigation of the effectiveness of dependency-based syntactic features on the irony detection task in a multilingual perspective (English, Spanish, French and Italian). It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme. Three distinct experimental settings are provided. In the first, a variety of syntactic dependency-based features combined with classical machine learning classifiers are explored. In the second scenario, two well-known types of word embeddings are trained on parsed data and tested against gold standard datasets. In the third setting, dependency-based syntactic features are combined into the Multilingual BERT architecture. The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony. △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: long paper accepted at COLING 2020

arXiv:2008.13597 [pdf, ps, other]

doi 10.1007/s12046-019-1224-8

Classifier Combination Approach for Question Classification for Bengali Question Answering System

Authors: Somnath Banerjee, Sudip Kumar Naskar, Paolo Rosso, Sivaji Bandyopadhyay

Abstract: Question classification (QC) is a prime constituent of automated question answering system. The work presented here demonstrates that the combination of multiple models achieve better classification performance than those obtained with existing individual models for the question classification task in Bengali. We have exploited state-of-the-art multiple model combination techniques, i.e., ensemble… ▽ More Question classification (QC) is a prime constituent of automated question answering system. The work presented here demonstrates that the combination of multiple models achieve better classification performance than those obtained with existing individual models for the question classification task in Bengali. We have exploited state-of-the-art multiple model combination techniques, i.e., ensemble, stacking and voting, to increase QC accuracy. Lexical, syntactic and semantic features of Bengali questions are used for four well-known classifiers, namely Naïve Bayes, kernel Naïve Bayes, Rule Induction, and Decision Tree, which serve as our base learners. Single-layer question-class taxonomy with 8 coarse-grained classes is extended to two-layer taxonomy by adding 69 fine-grained classes. We carried out the experiments both on single-layer and two-layer taxonomies. Experimental results confirmed that classifier combination approaches outperform single classifier classification approaches by 4.02% for coarse-grained question classes. Overall, the stacking approach produces the best results for fine-grained classification and achieves 87.79% of accuracy. The approach presented here could be used in other Indo-Aryan or Indic languages to develop a question answering system. △ Less

Submitted 6 September, 2020; v1 submitted 31 August, 2020; originally announced August 2020.

Comments: 16 pages, to be published in Sadhana

Journal ref: Sadhana, Springer, 2019

arXiv:2008.13173 [pdf, other]

LIMSI_UPV at SemEval-2020 Task 9: Recurrent Convolutional Neural Network for Code-mixed Sentiment Analysis

Authors: Somnath Banerjee, Sahar Ghannay, Sophie Rosset, Anne Vilnat, Paolo Rosso

Abstract: This paper describes the participation of LIMSI UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix Hindi-English subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and… ▽ More This paper describes the participation of LIMSI UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix Hindi-English subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and the convolutional network to better capture the semantics of the text, for code-mixed sentiment analysis. The proposed system obtained 0.69 (best run) in terms of F1 score on the given test data and achieved the 9th place (Codalab username: somban) in the SentiMix Hindi-English subtask. △ Less

Submitted 30 August, 2020; originally announced August 2020.

Comments: To be published in the Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, Sep. Association for Computational Linguistics

arXiv:2008.01578 [pdf, other]

Automatic Dataset Builder for Machine Learning Applications to Satellite Imagery

Authors: Alessandro Sebastianelli, Maria Pia Del Rosso, Silvia Liberata Ullo

Abstract: Nowadays the use of Machine Learning (ML) algorithms is spreading in the field of Remote Sensing, with applications ranging from detection and classification of land use and monitoring to the prediction of many natural or anthropic phenomena of interest. One main limit of their employment is related to the need for a huge amount of data for training the neural network, chosen for the specific appl… ▽ More Nowadays the use of Machine Learning (ML) algorithms is spreading in the field of Remote Sensing, with applications ranging from detection and classification of land use and monitoring to the prediction of many natural or anthropic phenomena of interest. One main limit of their employment is related to the need for a huge amount of data for training the neural network, chosen for the specific application, and the resulting computational weight and time required to collect the necessary data. In this letter the architecture of an innovative tool, enabling researchers to create in an automatic way suitable datasets for AI (Artificial Intelligence) applications in the EO (Earth Observation) context, is presented. Two versions of the architecture have been implemented and made available on Git-Hub, with a specific Graphical User Interface (GUI) for non-expert users. △ Less

Submitted 4 August, 2020; originally announced August 2020.

arXiv:2007.14936 [pdf, other]

doi 10.3233/JIFS-179895

#Brexit: Leave or Remain? The Role of User's Community and Diachronic Evolution on Stance Detection

Authors: Mirko Lai, Viviana Patti, Giancarlo Ruffo, Paolo Rosso

Abstract: Interest has grown around the classification of stance that users assume within online debates in recent years. Stance has been usually addressed by considering users posts in isolation, while social studies highlight that social communities may contribute to influence users' opinion. Furthermore, stance should be studied in a diachronic perspective, since it could help to shed light on users' opi… ▽ More Interest has grown around the classification of stance that users assume within online debates in recent years. Stance has been usually addressed by considering users posts in isolation, while social studies highlight that social communities may contribute to influence users' opinion. Furthermore, stance should be studied in a diachronic perspective, since it could help to shed light on users' opinion shift dynamics that can be recorded during the debate. We analyzed the political discussion in UK about the BREXIT referendum on Twitter, proposing a novel approach and annotation schema for stance detection, with the main aim of investigating the role of features related to social network community and diachronic stance evolution. Classification experiments show that such features provide very useful clues for detecting stance. △ Less

Submitted 29 July, 2020; originally announced July 2020.

Comments: To appear in Journal of Intelligent & Fuzzy Systems

arXiv:2004.09501 [pdf, other]

Application of DInSAR Technique to High Coherence Satellite Images for Strategic Infrastructure Monitoring

Authors: Tony De Corso, Luca Mignone, Alessandro Sebastianelli, Maria Pia Del Rosso, Claire Yost, Elena Ciampa, Marisa Pecce, Stefania Sica, Silvia Ullo

Abstract: In this paper the authors present and validate a procedure, which intends to combine the latest state of the art models in bridge monitoring with freely available satellite data. Through the Differential SAR interferometry (DinSAR) technique, a dataset of displacements for the Morandi bridge in Genoa (Italy), before its collapse, has been created, by using images downloaded by the Copernicus Open-… ▽ More In this paper the authors present and validate a procedure, which intends to combine the latest state of the art models in bridge monitoring with freely available satellite data. Through the Differential SAR interferometry (DinSAR) technique, a dataset of displacements for the Morandi bridge in Genoa (Italy), before its collapse, has been created, by using images downloaded by the Copernicus Open-Access Hub and the ASFVertex Hub. The data have been processed through the ESA SNAP software to identify the rate of displacements in the parts of the bridge where collapse occurred. Results demonstrate that the adopted procedure has great potentiality in the application field, as it represents a simple and inexpensive method to monitor large structures in a continuous way, by hel** to better quantify risks and guide effective mitigation countermeasures. Moreover, the same procedure, once properly validated, could be effectively extended to the current and future performance estimation of civil infrastructures. △ Less

Submitted 19 April, 2020; originally announced April 2020.

Journal ref: IGARSS 2020 IEEE International Geoscience Remote Sensing Symposium

arXiv:2002.02427 [pdf, ps, other]

Irony Detection in a Multilingual Context

Authors: Bilal Ghanem, Jihen Karoui, Farah Benamara, Paolo Rosso, Véronique Moriceau

Abstract: This paper proposes the first multilingual (French, English and Arabic) and multicultural (Indo-European languages vs. less culturally close languages) irony detection system. We employ both feature-based models and neural architectures using monolingual word representation. We compare the performance of these systems with state-of-the-art systems to identify their capabilities. We show that these… ▽ More This paper proposes the first multilingual (French, English and Arabic) and multicultural (Indo-European languages vs. less culturally close languages) irony detection system. We employ both feature-based models and neural architectures using monolingual word representation. We compare the performance of these systems with state-of-the-art systems to identify their capabilities. We show that these monolingual models trained separately on different languages using multilingual word representation or text-based features can open the door to irony detection in languages that lack of annotated data for irony. △ Less

Submitted 6 February, 2020; originally announced February 2020.

arXiv:1910.14011 [pdf, other]

Stryker: Scaling Specification-Based Program Repair by Pruning Infeasible Mutants with SAT

Authors: Luciano Zemín, Simón Gutiérrez Brida, Santiago Bermúdez, Santiago Perez De Rosso, Nazareno Aguirre, Ali Mili, Ali Jaoua, Marcelo F. Frias

Abstract: Many techniques for automated program repair involve syntactic program transformations. Applying combinations of such transformations on faulty code yields fix candidates whose correctness must be determined. Exploring these combinations leads to an explosion on the number of generated fix candidates that severely limits the applicability of such fault repair techniques. This explosion is most tim… ▽ More Many techniques for automated program repair involve syntactic program transformations. Applying combinations of such transformations on faulty code yields fix candidates whose correctness must be determined. Exploring these combinations leads to an explosion on the number of generated fix candidates that severely limits the applicability of such fault repair techniques. This explosion is most times tamed by not considering fix candidates exhaustively, and by disabling intra-statement modifications. In this article we present a technique for program repair that considers an ample set of intra-statement syntactic operations, and explores fix candidates exhaustively up to a provided bound. The suitability of the technique, implemented in our tool Stryker, is supported by a novel mechanism to detect and prune infeasible fix candidates. This allows Stryker to repair programs with several bugs, whose fixes require multiple modifications. We evaluate our technique on a benchmark of faulty Java container classes, which Stryker is able to repair, pruning significant parts of the space of generated candidates when more than one bug is present in the code. △ Less

Submitted 30 October, 2019; originally announced October 2019.

MSC Class: 68Q60

arXiv:1910.06592 [pdf, other]

FacTweet: Profiling Fake News Twitter Accounts

Authors: Bilal Ghanem, Simone Paolo Ponzetto, Paolo Rosso

Abstract: We present an approach to detect fake news in Twitter at the account level using a neural recurrent model and a variety of different semantic and stylistic features. Our method extracts a set of features from the timelines of news Twitter accounts by reading their posts as chunks, rather than dealing with each tweet independently. We show the experimental benefits of modeling latent stylistic sign… ▽ More We present an approach to detect fake news in Twitter at the account level using a neural recurrent model and a variety of different semantic and stylistic features. Our method extracts a set of features from the timelines of news Twitter accounts by reading their posts as chunks, rather than dealing with each tweet independently. We show the experimental benefits of modeling latent stylistic signatures of mixed fake and real news with a sequential model over a wide range of strong baselines. △ Less

Submitted 15 October, 2019; originally announced October 2019.

Comments: 6 pages

arXiv:1910.01340 [pdf, other]

TexTrolls: Identifying Russian Trolls on Twitter from a Textual Perspective

Authors: Bilal Ghanem, Davide Buscaldi, Paolo Rosso

Abstract: The online new emerging suspicious users, that usually are called trolls, are one of the main sources of hate, fake, and deceptive online messages. Some agendas are utilizing these harmful users to spread incitement tweets, and as a consequence, the audience get deceived. The challenge in detecting such accounts is that they conceal their identities which make them disguised in social media, addin… ▽ More The online new emerging suspicious users, that usually are called trolls, are one of the main sources of hate, fake, and deceptive online messages. Some agendas are utilizing these harmful users to spread incitement tweets, and as a consequence, the audience get deceived. The challenge in detecting such accounts is that they conceal their identities which make them disguised in social media, adding more difficulty to identify them using just their social network information. Therefore, in this paper, we propose a text-based approach to detect the online trolls such as those that were discovered during the US 2016 presidential elections. Our approach is mainly based on textual features which utilize thematic information, and profiling features to identify the accounts from their way of writing tweets. We deduced the thematic information in a unsupervised way and we show that coupling them with the textual features enhanced the performance of the proposed model. In addition, we find that the proposed profiling features perform the best comparing to the textual features. △ Less

Submitted 3 October, 2019; originally announced October 2019.

Comments: 15 pages

arXiv:1908.09951 [pdf, other]

An Emotional Analysis of False Information in Social Media and News Articles

Authors: Bilal Ghanem, Paolo Rosso, Francisco Rangel

Abstract: Fake news is risky since it has been created to manipulate the readers' opinions and beliefs. In this work, we compared the language of false news to the real one of real news from an emotional perspective, considering a set of false information types (propaganda, hoax, clickbait, and satire) from social media and online news articles sources. Our experiments showed that false information has diff… ▽ More Fake news is risky since it has been created to manipulate the readers' opinions and beliefs. In this work, we compared the language of false news to the real one of real news from an emotional perspective, considering a set of false information types (propaganda, hoax, clickbait, and satire) from social media and online news articles sources. Our experiments showed that false information has different emotional patterns in each of its types, and emotions play a key role in deceiving the reader. Based on that, we proposed a LSTM neural network model that is emotionally-infused to detect false news. △ Less

Submitted 26 August, 2019; originally announced August 2019.

arXiv:1906.06151 [pdf, other]

Landslide Geohazard Assessment With Convolutional Neural Networks Using Sentinel-2 Imagery Data

Authors: Silvia L. Ullo, Maximillian S. Langenkamp, Tuomas P. Oikarinen, Maria P. Del Rosso, Alessandro Sebastianelli, Federica Piccirillo, Stefania Sica

Abstract: In this paper, the authors aim to combine the latest state of the art models in image recognition with the best publicly available satellite images to create a system for landslide risk mitigation. We focus first on landslide detection and further propose a similar system to be used for prediction. Such models are valuable as they could easily be scaled up to provide data for hazard evaluation, as… ▽ More In this paper, the authors aim to combine the latest state of the art models in image recognition with the best publicly available satellite images to create a system for landslide risk mitigation. We focus first on landslide detection and further propose a similar system to be used for prediction. Such models are valuable as they could easily be scaled up to provide data for hazard evaluation, as satellite imagery becomes increasingly available. The goal is to use satellite images and correlated data to enrich the public repository of data and guide disaster relief efforts for locating precise areas where landslides have occurred. Different image augmentation methods are used to increase diversity in the chosen dataset and create more robust classification. The resulting outputs are then fed into variants of 3-D convolutional neural networks. A review of the current literature indicates there is no research using CNNs (Convolutional Neural Networks) and freely available satellite imagery for classifying landslide risk. The model has shown to be ultimately able to achieve a significantly better than baseline accuracy. △ Less

Submitted 10 June, 2019; originally announced June 2019.

Comments: 4 pages, 3 figures, 1 table, accepted to 2019 IEEE IGARSS Conference that will be held in Japan next July

arXiv:1906.04836 [pdf, other]

Unmasking Bias in News

Authors: Javier Sánchez-Junquera, Paolo Rosso, Manuel Montes-y-Gómez, Simone Paolo Ponzetto

Abstract: We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand. Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones. We additionally show that competitive results can be achieved by simply including higher-length n-grams, whi… ▽ More We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand. Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones. We additionally show that competitive results can be achieved by simply including higher-length n-grams, which suggests the need to develop more challenging datasets and tasks that address implicit and more subtle forms of bias. △ Less

Submitted 11 June, 2019; originally announced June 2019.

arXiv:1811.03091 [pdf]

doi 10.1016/j.optlastec.2019.04.005

Low-dispersion low-loss dielectric gratings for efficient ultrafast laser pulse compression at high average powers

Authors: David A. Alessi, Hoang T. Nguyen, Jerald A. Britten, Paul A. Rosso, Constantin Haefner

Abstract: We have developed low-dispersion (1480 l/mm), resonance-free, diffraction gratings made of dielectric materials resistant to femtosecond laser damage $(SiO_{2}/HfO_{2})$. A 14 cm diameter sample was fabricated resulting in a mean diffraction efficiency of 99.1% at λ = 810 nm with 0.4% uniformity using equipment which can fabricate gratings up to 1m diagonal. The implementation of these gratings in… ▽ More We have developed low-dispersion (1480 l/mm), resonance-free, diffraction gratings made of dielectric materials resistant to femtosecond laser damage $(SiO_{2}/HfO_{2})$. A 14 cm diameter sample was fabricated resulting in a mean diffraction efficiency of 99.1% at λ = 810 nm with 0.4% uniformity using equipment which can fabricate gratings up to 1m diagonal. The implementation of these gratings in the compression of 30 fs pulses in an out-of-plane geometry can result in compressor efficiencies of ~95%. The measured laser absorption is 500x lower than current ultrafast petawatt-class compressor gratings which will enable a substantial increase in average power handling capabilities of these laser systems. △ Less

Submitted 6 November, 2018; originally announced November 2018.

arXiv:1807.11584 [pdf, ps, other]

UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering

Authors: Marc Franco-Salvador, Sudipta Kar, Thamar Solorio, Paolo Rosso

Abstract: In this work we describe the system built for the three English subtasks of the SemEval 2016 Task 3 by the Department of Computer Science of the University of Houston (UH) and the Pattern Recognition and Human Language Technology (PRHLT) research center - Universitat Polit`ecnica de Val`encia: UH-PRHLT. Our system represents instances by using both lexical and semantic-based similarity measures be… ▽ More In this work we describe the system built for the three English subtasks of the SemEval 2016 Task 3 by the Department of Computer Science of the University of Houston (UH) and the Pattern Recognition and Human Language Technology (PRHLT) research center - Universitat Polit`ecnica de Val`encia: UH-PRHLT. Our system represents instances by using both lexical and semantic-based similarity measures between text pairs. Our semantic features include the use of distributed representations of words, knowledge graphs generated with the BabelNet multilingual semantic network, and the FrameNet lexical database. Experimental results outperform the random and Google search engine baselines in the three English subtasks. Our approach obtained the highest results of subtask B compared to the other task participants. △ Less

Submitted 30 July, 2018; originally announced July 2018.

Comments: Top system for question-question similarity in SemEval 2016 Task 3

arXiv:1805.11611 [pdf, other]

doi 10.3233/JIFS-169483

Semantically-informed distance and similarity measures for paraphrase plagiarism identification

Authors: Miguel A. Álvarez-Carmona, Marc Franco-Salvador, Esaú Villatoro-Tello, Manuel Montes-y-Gómez, Paolo Rosso, Luis Villaseñor-Pineda

Abstract: Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic inform… ▽ More Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic information from either an external resource or a distributed representation of words, resulting in informative features for training a supervised classifier for detecting paraphrase plagiarism. Obtained results indicate that the proposed metrics are consistently good in detecting different types of paraphrase plagiarism. In addition, results are very competitive against state-of-the art methods having the advantage of representing a much more simple but equally effective solution. △ Less

Submitted 29 May, 2018; originally announced May 2018.

Journal ref: Journal of Intelligent & Fuzzy Systems, vol. 34, no. 5, pp. 2983-2990, 2018

arXiv:1801.06436 [pdf, other]

A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Authors: Goran Glavaš, Marc Franco-Salvador, Simone Paolo Ponzetto, Paolo Rosso

Abstract: Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named ent… ▽ More Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many languages (or language pairs) do not exist. In contrast, we propose an unsupervised and a very resource-light approach for measuring semantic similarity between texts in different languages. To operate in the bilingual (or multilingual) space, we project continuous word vectors (i.e., word embeddings) from one language to the vector space of the other language via the linear translation model. We then align words according to the similarity of their vectors in the bilingual embedding space and investigate different unsupervised measures of semantic similarity exploiting bilingual embeddings and word alignments. Requiring only a limited-size set of word translation pairs between the languages, the proposed approach is applicable to virtually any pair of languages for which there exists a sufficiently large corpus, required to learn monolingual word embeddings. Experimental results on three different datasets for measuring semantic textual similarity show that our simple resource-light approach reaches performance close to that of supervised and resource intensive methods, displaying stability across different language pairs. Furthermore, we evaluate the proposed method on two extrinsic tasks, namely extraction of parallel sentences from comparable corpora and cross lingual plagiarism detection, and show that it yields performance comparable to those of complex resource-intensive state-of-the-art models for the respective tasks. △ Less

Submitted 19 January, 2018; originally announced January 2018.

Comments: Accepted for publication in Knowledge-Based Systems journal

arXiv:1705.10754 [pdf, other]

A Low Dimensionality Representation for Language Variety Identification

Authors: Francisco Rangel, Marc Franco-Salvador, Paolo Rosso

Abstract: Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our L… ▽ More Language variety identification aims at labelling texts in a native language (e.g. Spanish, Portuguese, English) with its specific variation (e.g. Argentina, Chile, Mexico, Peru, Spain; Brazil, Portugal; UK, US). In this work we propose a low dimensionality representation (LDR) to address this task with five different varieties of Spanish: Argentina, Chile, Mexico, Peru and Spain. We compare our LDR method with common state-of-the-art representations and show an increase in accuracy of ~35%. Furthermore, we compare LDR with two reference distributed representation models. Experimental results show competitive performance while dramatically reducing the dimensionality --and increasing the big data suitability-- to only 6 features per variety. Additionally, we analyse the behaviour of the employed machine learning algorithms and the most discriminating features. Finally, we employ an alternative dataset to test the robustness of our low dimensionality representation with another set of similar languages. △ Less

Submitted 30 May, 2017; originally announced May 2017.

Journal ref: CICLing - Computational Linguistics and Intelligent Text Processing, 2016

arXiv:1702.08021 [pdf, ps, other]

doi 10.1007/978-3-319-62434-1_13

Friends and Enemies of Clinton and Trump: Using Context for Detecting Stance in Political Tweets

Authors: Mirko Lai, Delia Irazú Hernández Farías, Viviana Patti, Paolo Rosso

Abstract: Stance detection, the task of identifying the speaker's opinion towards a particular target, has attracted the attention of researchers. This paper describes a novel approach for detecting stance in Twitter. We define a set of features in order to consider the context surrounding a target of interest with the final aim of training a model for predicting the stance towards the mentioned targets. In… ▽ More Stance detection, the task of identifying the speaker's opinion towards a particular target, has attracted the attention of researchers. This paper describes a novel approach for detecting stance in Twitter. We define a set of features in order to consider the context surrounding a target of interest with the final aim of training a model for predicting the stance towards the mentioned targets. In particular, we are interested in investigating political debates in social media. For this reason we evaluated our approach focusing on two targets of the SemEval-2016 Task6 on Detecting stance in tweets, which are related to the political campaign for the 2016 U.S. presidential elections: Hillary Clinton vs. Donald Trump. For the sake of comparison with the state of the art, we evaluated our model against the dataset released in the SemEval-2016 Task 6 shared task competition. Our results outperform the best ones obtained by participating teams, and show that information about enemies and friends of politicians help in detecting stance towards them. △ Less

Submitted 26 February, 2017; originally announced February 2017.

Comments: To appear in MICAI 2016 LNAI Proceedings

arXiv:1402.3070 [pdf, other]

Squeezing bottlenecks: exploring the limits of autoencoder semantic representation capabilities

Authors: Parth Gupta, Rafael E. Banchs, Paolo Rosso

Abstract: We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the following issues: i) we explore the suitability of two different models bDA and rsDA for constructing deep autoencoders for text data at the sentence level; ii) we propose and evaluate two novel metrics for better assessing the text-reconst… ▽ More We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the following issues: i) we explore the suitability of two different models bDA and rsDA for constructing deep autoencoders for text data at the sentence level; ii) we propose and evaluate two novel metrics for better assessing the text-reconstruction capabilities of autoencoders; and iii) we propose an automatic method to find the critical bottleneck dimensionality for text language representations (below which structural information is lost). △ Less

Submitted 13 February, 2014; originally announced February 2014.

Showing 1–48 of 48 results for author: Rosso, P