Search | arXiv e-print repository

The Geography of Information Diffusion in Online Discourse on Europe and Migration

Authors: Elisa Leonardelli, Sara Tonelli

Abstract: The online diffusion of information related to Europe and migration has been little investigated from an external point of view. However, this is a very relevant topic, especially if users have had no direct contact with Europe and its perception depends solely on information retrieved online. In this work we analyse the information circulating online about Europe and migration after retrieving a… ▽ More The online diffusion of information related to Europe and migration has been little investigated from an external point of view. However, this is a very relevant topic, especially if users have had no direct contact with Europe and its perception depends solely on information retrieved online. In this work we analyse the information circulating online about Europe and migration after retrieving a large amount of data from social media (Twitter), to gain new insights into topics, magnitude, and dynamics of their diffusion. We combine retweets and hashtags network analysis with geolocation of users, linking thus data to geography and allowing analysis from an "outside Europe" perspective, with a special focus on Africa. We also introduce a novel approach based on cross-lingual quotes, i.e. when content in a language is commented and retweeted in another language, assuming these interactions are a proxy for connections between very distant communities. Results show how the majority of online discussions occurs at a national level, especially when discussing migration. Language (English) is pivotal for information to become transnational and reach far. Transnational information flow is strongly unbalanced, with content mainly produced in Europe and amplified outside. Conversely Europe-based accounts tend to be self-referential when they discuss migration-related topics. Football is the most exported topic from Europe worldwide. Moreover, important nodes in the communities discussing migration-related topics include accounts of official institutions and international agencies, together with journalists, news, commentators and activists. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.02975 [pdf, other]

Putting Context in Context: the Impact of Discussion Structure on Text Classification

Authors: Nicolò Penzo, Antonio Longa, Bruno Lepri, Sara Tonelli, Marco Guerini

Abstract: Current text classification approaches usually focus on the content to be classified. Contextual aspects (both linguistic and extra-linguistic) are usually neglected, even in tasks based on online discussions. Still in many cases the multi-party and multi-turn nature of the context from which these elements are selected can be fruitfully exploited. In this work, we propose a series of experiments… ▽ More Current text classification approaches usually focus on the content to be classified. Contextual aspects (both linguistic and extra-linguistic) are usually neglected, even in tasks based on online discussions. Still in many cases the multi-party and multi-turn nature of the context from which these elements are selected can be fruitfully exploited. In this work, we propose a series of experiments on a large dataset for stance detection in English, in which we evaluate the contribution of different types of contextual information, i.e. linguistic, structural and temporal, by feeding them as natural language input into a transformer-based model. We also experiment with different amounts of training data and analyse the topology of local discussion networks in a privacy-compliant way. Results show that structural information can be highly beneficial to text classification but only under certain circumstances (e.g. depending on the amount of training data and on discussion chain complexity). Indeed, we show that contextual information on smaller datasets from other classification tasks does not yield significant improvements. Our framework, based on local discussion networks, allows the integration of structural information, while minimising user profiling, thus preserving their privacy. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted to EACL 2024 main conference

arXiv:2109.13563 [pdf, other]

doi 10.18653/v1/2021.emnlp-main.822

Agreeing to Disagree: Annotating Offensive Language Datasets with Annotators' Disagreement

Authors: Elisa Leonardelli, Stefano Menini, Alessio Palmero Aprosio, Marco Guerini, Sara Tonelli

Abstract: Since state-of-the-art approaches to offensive language detection rely on supervised learning, it is crucial to quickly adapt them to the continuously evolving scenario of social media. While several approaches have been proposed to tackle the problem from an algorithmic perspective, so to reduce the need for annotated data, less attention has been paid to the quality of these data. Following a tr… ▽ More Since state-of-the-art approaches to offensive language detection rely on supervised learning, it is crucial to quickly adapt them to the continuously evolving scenario of social media. While several approaches have been proposed to tackle the problem from an algorithmic perspective, so to reduce the need for annotated data, less attention has been paid to the quality of these data. Following a trend that has emerged recently, we focus on the level of agreement among annotators while selecting data to create offensive language datasets, a task involving a high level of subjectivity. Our study comprises the creation of three novel datasets of English tweets covering different topics and having five crowd-sourced judgments each. We also present an extensive set of experiments showing that selecting training and test data according to different levels of annotators' agreement has a strong effect on classifiers performance and robustness. Our findings are further validated in cross-domain experiments and studied using a popular benchmark dataset. We show that such hard cases, where low agreement is present, are not necessarily due to poor-quality annotation and we advocate for a higher presence of ambiguous cases in future datasets, particularly in test sets, to better account for the different points of view expressed online. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: To appear at EMNLP 2021 (long paper)

arXiv:2109.12053 [pdf, other]

doi 10.18653/v1/2021.findings-emnlp.250

Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus

Authors: Daniela Trotta, Raffaele Guarasci, Elisa Leonardelli, Sara Tonelli

Abstract: The development of automated approaches to linguistic acceptability has been greatly fostered by the availability of the English CoLA corpus, which has also been included in the widely used GLUE benchmark. However, this kind of research for languages other than English, as well as the analysis of cross-lingual approaches, has been hindered by the lack of resources with a comparable size in other l… ▽ More The development of automated approaches to linguistic acceptability has been greatly fostered by the availability of the English CoLA corpus, which has also been included in the widely used GLUE benchmark. However, this kind of research for languages other than English, as well as the analysis of cross-lingual approaches, has been hindered by the lack of resources with a comparable size in other languages. We have therefore developed the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments, which has been created following the same approach and the same steps as the English one. In this paper we describe the corpus creation, we detail its content, and we present the first experiments on this new resource. We compare in-domain and out-of-domain classification, and perform a specific evaluation of nine linguistic phenomena. We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformerbased approaches can benefit from using sentences in two languages during fine-tuning. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Comments: Findings of EMNLP 2021. Dataset available at https://github.com/dhfbk/ItaCoLA-dataset

arXiv:2107.02472 [pdf, other]

doi 10.1016/j.osnem.2021.100150

Empowering NGOs in Countering Online Hate Messages

Authors: Yi-Ling Chung, Serra Sinem Tekiroglu, Sara Tonelli, Marco Guerini

Abstract: Studies on online hate speech have mostly focused on the automated detection of harmful messages. Little attention has been devoted so far to the development of effective strategies to fight hate speech, in particular through the creation of counter-messages. While existing manual scrutiny and intervention strategies are time-consuming and not scalable, advances in natural language processing have… ▽ More Studies on online hate speech have mostly focused on the automated detection of harmful messages. Little attention has been devoted so far to the development of effective strategies to fight hate speech, in particular through the creation of counter-messages. While existing manual scrutiny and intervention strategies are time-consuming and not scalable, advances in natural language processing have the potential to provide a systematic approach to hatred management. In this paper, we introduce a novel ICT platform that NGO operators can use to monitor and analyze social media data, along with a counter-narrative suggestion tool. Our platform aims at increasing the efficiency and effectiveness of operators' activities against islamophobia. We test the platform with more than one hundred NGO operators in three countries through qualitative and quantitative evaluation. Results show that NGOs favor the platform solution with the suggestion tool, and that the time required to produce counter-narratives significantly decreases. △ Less

Submitted 6 July, 2021; originally announced July 2021.

Comments: Preprint of the paper published in Online Social Networks and Media Journal (OSNEM)

arXiv:2103.14916 [pdf, other]

Abuse is Contextual, What about NLP? The Role of Context in Abusive Language Annotation and Detection

Authors: Stefano Menini, Alessio Palmero Aprosio, Sara Tonelli

Abstract: The datasets most widely used for abusive language detection contain lists of messages, usually tweets, that have been manually judged as abusive or not by one or more annotators, with the annotation performed at message level. In this paper, we investigate what happens when the hateful content of a message is judged also based on the context, given that messages are often ambiguous and need to be… ▽ More The datasets most widely used for abusive language detection contain lists of messages, usually tweets, that have been manually judged as abusive or not by one or more annotators, with the annotation performed at message level. In this paper, we investigate what happens when the hateful content of a message is judged also based on the context, given that messages are often ambiguous and need to be interpreted in the context of occurrence. We first re-annotate part of a widely used dataset for abusive language detection in English in two conditions, i.e. with and without context. Then, we compare the performance of three classification algorithms obtained on these two types of dataset, arguing that a context-aware classification is more challenging but also more similar to a real application scenario. △ Less

Submitted 27 March, 2021; originally announced March 2021.

arXiv:2005.02235 [pdf, other]

Creating a Multimodal Dataset of Images and Text to Study Abusive Language

Authors: Alessio Palmero Aprosio, Stefano Menini, Sara Tonelli

Abstract: In order to study online hate speech, the availability of datasets containing the linguistic phenomena of interest are of crucial importance. However, when it comes to specific target groups, for example teenagers, collecting such data may be problematic due to issues with consent and privacy restrictions. Furthermore, while text-only datasets of this kind have been widely used, limitations set by… ▽ More In order to study online hate speech, the availability of datasets containing the linguistic phenomena of interest are of crucial importance. However, when it comes to specific target groups, for example teenagers, collecting such data may be problematic due to issues with consent and privacy restrictions. Furthermore, while text-only datasets of this kind have been widely used, limitations set by image-based social media platforms like Instagram make it difficult for researchers to experiment with multimodal hate speech data. We therefore developed CREENDER, an annotation tool that has been used in school classes to create a multimodal dataset of images and abusive comments, which we make freely available under Apache 2.0 license. The corpus, with Italian comments, has been analysed from different perspectives, to investigate whether the subject of the images plays a role in triggering a comment. We find that users judge the same images in different ways, although the presence of a person in the picture increases the probability to get an offensive comment. △ Less

Submitted 5 May, 2020; originally announced May 2020.

arXiv:1912.07551 [pdf, other]

doi 10.1140/epjds/s13688-019-0215-7

Following the footsteps of giants: Modeling the mobility of historically notable individuals using Wikipedia

Authors: Lorenzo Lucchini, Sara Tonelli, Bruno Lepri

Abstract: The steady growth of digitized historical information is continuously stimulating new different approaches to the fields of Digital Humanities and Computational Social Science. In this work, we use Natural Language Processing techniques to retrieve large amounts of historical information from Wikipedia. In particular, the pages of a set of historically notable individuals are processed to catch th… ▽ More The steady growth of digitized historical information is continuously stimulating new different approaches to the fields of Digital Humanities and Computational Social Science. In this work, we use Natural Language Processing techniques to retrieve large amounts of historical information from Wikipedia. In particular, the pages of a set of historically notable individuals are processed to catch the locations and the date of people's movements. This information is then structured in a geographical network of mobility patterns. We analyze the mobility of historically notable individuals from different perspectives to better understand the role of migrations and international collaborations in the context of innovation and cultural development. In this work, we first present some general characteristics of the dataset from a social and geographical perspective. Then, we build a spatial network of cities, and we model and quantify the tendency to explore by a set of people that can be considered historically and culturally notable. In this framework, we show that by using a multilevel radiation model for human mobility, we are able to catch important features of migration's behavior. Results show that the choice of the target migration place for historically and culturally relevant people is limited to a small number of locations and that it depends on the discipline a notable is interested in and on the number of opportunities she/he can find there. △ Less

Submitted 16 December, 2019; originally announced December 2019.

Journal ref: EPJ Data Sci. 8, 36 (2019)

Showing 1–8 of 8 results for author: Tonelli, S