Search | arXiv e-print repository

Online disinformation in the 2020 U.S. Election: swing vs. safe states

Authors: Manuel Pratelli, Marinella Petrocchi, Fabio Saracco, Rocco De Nicola

Abstract: For U.S. presidential elections, most states use the so-called winner-take-all system, in which the state's presidential electors are awarded to the winning political party in the state after a popular vote phase, regardless of the actual margin of victory. Therefore, election campaigns are especially intense in states where there is no clear direction on which party will be the winning party. The… ▽ More For U.S. presidential elections, most states use the so-called winner-take-all system, in which the state's presidential electors are awarded to the winning political party in the state after a popular vote phase, regardless of the actual margin of victory. Therefore, election campaigns are especially intense in states where there is no clear direction on which party will be the winning party. These states are often referred to as swing states. To measure the impact of such an election law on the campaigns, we analyze the Twitter activity surrounding the 2020 US preelection debate, with a particular focus on the spread of disinformation. We find that about 88% of the online traffic was associated with swing states. In addition, the sharing of links to unreliable news sources is significantly more prevalent in tweets associated with swing states: in this case, untrustworthy tweets are predominantly generated by automated accounts. Furthermore, we observe that the debate is mostly led by two main communities, one with a predominantly Republican affiliation and the other with accounts of different political orientations. Most of the disinformation comes from the former. △ Less

Submitted 12 March, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

Comments: arXiv admin note: text overlap with arXiv:2303.12474

arXiv:2402.18621 [pdf, other]

Unveiling News Publishers Trustworthiness Through Social Interactions

Authors: Manuel Pratelli, Fabio Saracco, Marinella Petrocchi

Abstract: With the primary goal of raising readers' awareness of misinformation phenomena, extensive efforts have been made by both academic institutions and independent organizations to develop methodologies for assessing the trustworthiness of online news publishers. Unfortunately, existing approaches are costly and face critical scalability challenges. This study presents a novel framework for assessing… ▽ More With the primary goal of raising readers' awareness of misinformation phenomena, extensive efforts have been made by both academic institutions and independent organizations to develop methodologies for assessing the trustworthiness of online news publishers. Unfortunately, existing approaches are costly and face critical scalability challenges. This study presents a novel framework for assessing the trustworthiness of online news publishers using user interactions on social media platforms. The proposed methodology provides a versatile solution that serves the dual purpose of i) identifying verifiable online publishers and ii) automatically performing an initial estimation of the trustworthiness of previously unclassified online news outlets. △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: A pre-final version of the paper accepted at WebSci'24

arXiv:2401.01781 [pdf, other]

Evaluating Trustworthiness of Online News Publishers via Article Classification

Authors: John Bianchi, Manuel Pratelli, Marinella Petrocchi, Fabio Pinelli

Abstract: The proliferation of low-quality online information in today's era has underscored the need for robust and automatic mechanisms to evaluate the trustworthiness of online news publishers. In this paper, we analyse the trustworthiness of online news media outlets by leveraging a dataset of 4033 news stories from 40 different sources. We aim to infer the trustworthiness level of the source based on t… ▽ More The proliferation of low-quality online information in today's era has underscored the need for robust and automatic mechanisms to evaluate the trustworthiness of online news publishers. In this paper, we analyse the trustworthiness of online news media outlets by leveraging a dataset of 4033 news stories from 40 different sources. We aim to infer the trustworthiness level of the source based on the classification of individual articles' content. The trust labels are obtained from NewsGuard, a journalistic organization that evaluates news sources using well-established editorial and publishing criteria. The results indicate that the classification model is highly effective in classifying the trustworthiness levels of the news articles. This research has practical applications in alerting readers to potentially untrustworthy news sources, assisting journalistic organizations in evaluating new or unfamiliar media outlets and supporting the selection of articles for their trustworthiness assessment. △ Less

Submitted 3 January, 2024; originally announced January 2024.

Comments: This paper will appear in the proceedings of the 2024 ACM/SIGAPP Symposium on Applied Computing, Avila, Spain, April 8-12, 2024. The version here submitted is the accepted version before publisher typesetting

arXiv:2308.01750 [pdf, other]

doi 10.1093/pnasnexus/pgae177

Entropy-based detection of Twitter echo chambers

Authors: Manuel Pratelli, Fabio Saracco, Marinella Petrocchi

Abstract: Echo chambers, i.e. clusters of users exposed to news and opinions in line with their previous beliefs, were observed in many online debates on social platforms. We propose a completely unbiased entropy-based method for detecting echo chambers. The method is completely agnostic to the nature of the data. In the Italian Twitter debate about the Covid-19 vaccination, we find a limited presence of us… ▽ More Echo chambers, i.e. clusters of users exposed to news and opinions in line with their previous beliefs, were observed in many online debates on social platforms. We propose a completely unbiased entropy-based method for detecting echo chambers. The method is completely agnostic to the nature of the data. In the Italian Twitter debate about the Covid-19 vaccination, we find a limited presence of users in echo chambers (about 0.35% of all users). Nevertheless, their impact on the formation of a common discourse is strong, as users in echo chambers are responsible for nearly a third of the retweets in the original dataset. Moreover, in the case study observed, echo chambers appear to be a receptacle for disinformative content. △ Less

Submitted 28 February, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

Comments: 30 pages, 11 figures, 7 tables

Journal ref: PNAS Nexus, Volume 3, Issue 5, May 2024, pgae177

arXiv:2304.07535 [pdf, other]

From Online Behaviours to Images: A Novel Approach to Social Bot Detection

Authors: Edoardo Di Paolo, Marinella Petrocchi, Angelo Spognardi

Abstract: Online Social Networks have revolutionized how we consume and share information, but they have also led to a proliferation of content not always reliable and accurate. One particular type of social accounts is known to promote unreputable content, hyperpartisan, and propagandistic information. They are automated accounts, commonly called bots. Focusing on Twitter accounts, we propose a novel appro… ▽ More Online Social Networks have revolutionized how we consume and share information, but they have also led to a proliferation of content not always reliable and accurate. One particular type of social accounts is known to promote unreputable content, hyperpartisan, and propagandistic information. They are automated accounts, commonly called bots. Focusing on Twitter accounts, we propose a novel approach to bot detection: we first propose a new algorithm that transforms the sequence of actions that an account performs into an image; then, we leverage the strength of Convolutional Neural Networks to proceed with image classification. We compare our performances with state-of-the-art results for bot detection on genuine accounts / bot accounts datasets well known in the literature. The results confirm the effectiveness of the proposal, because the detection capability is on par with the state of the art, if not better in some cases. △ Less

Submitted 15 April, 2023; originally announced April 2023.

Comments: Accepted @ICCS2023, 23th International Conference on Computational Science, 3-5 July, 2023. The present version is a preprint

arXiv:2303.17251 [pdf, other]

Demystifying Misconceptions in Social Bots Research

Authors: Stefano Cresci, Kai-Cheng Yang, Angelo Spognardi, Roberto Di Pietro, Filippo Menczer, Marinella Petrocchi

Abstract: Research on social bots aims at advancing knowledge and providing solutions to one of the most debated forms of online manipulation. Yet, social bot research is plagued by widespread biases, hyped results, and misconceptions that set the stage for ambiguities, unrealistic expectations, and seemingly irreconcilable findings. Overcoming such issues is instrumental towards ensuring reliable solutions… ▽ More Research on social bots aims at advancing knowledge and providing solutions to one of the most debated forms of online manipulation. Yet, social bot research is plagued by widespread biases, hyped results, and misconceptions that set the stage for ambiguities, unrealistic expectations, and seemingly irreconcilable findings. Overcoming such issues is instrumental towards ensuring reliable solutions and reaffirming the validity of the scientific method. In this contribution, we review some recent results in social bots research, highlighting and revising factual errors as well as methodological and conceptual biases. More importantly, we demystify common misconceptions, addressing fundamental points on how social bots research is discussed. Our analysis surfaces the need to discuss research about online disinformation and manipulation in a rigorous, unbiased, and responsible way. This article bolsters such effort by identifying and refuting common fallacious arguments used by both proponents and opponents of social bots research, as well as providing directions toward sound methodologies for future research in the field. △ Less

Submitted 27 March, 2024; v1 submitted 30 March, 2023; originally announced March 2023.

arXiv:2303.12474 [pdf, other]

Swinging in the States: Does disinformation on Twitter mirror the US presidential election system?

Authors: Manuel Pratelli, Marinella Petrocchi, Fabio Saracco, Rocco De Nicola

Abstract: For more than a decade scholars have been investigating the disinformation flow on social media contextually to societal events, like, e.g., elections. In this paper, we analyze the Twitter traffic related to the US 2020 pre-election debate and ask whether it mirrors the electoral system. The U.S. electoral system provides that, regardless of the actual vote gap, the premier candidate who received… ▽ More For more than a decade scholars have been investigating the disinformation flow on social media contextually to societal events, like, e.g., elections. In this paper, we analyze the Twitter traffic related to the US 2020 pre-election debate and ask whether it mirrors the electoral system. The U.S. electoral system provides that, regardless of the actual vote gap, the premier candidate who received more votes in one state `takes' that state. Criticisms of this system have pointed out that election campaigns can be more intense in particular key states to achieve victory, so-called {\it swing states}. Our intuition is that election debate may cause more traffic on Twitter-and probably be more plagued by misinformation-when associated with swing states. The results mostly confirm the intuition. About 88\% of the entire traffic can be associated with swing states, and links to non-trustworthy news are shared far more in swing-related traffic than the same type of news in safe-related traffic. Considering traffic origin instead, non-trustworthy tweets generated by automated accounts, so-called social bots, are mostly associated with swing states. Our work sheds light on the role an electoral system plays in the evolution of online debates, with, in the spotlight, disinformation and social bots. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 9 pages, 2 figures; Accepted @CySoc 2023, International Workshop on Cyber Social Threats, co-located with the ACM Web conference 2023, April 30, 2023. The present version is a preprint

arXiv:2205.02736 [pdf, other]

A Structured Analysis of Journalistic Evaluations for News Source Reliability

Authors: Manuel Pratelli, Marinella Petrocchi

Abstract: In today's era of information disorder, many organizations are moving to verify the veracity of news published on the web and social media. In particular, some agencies are exploring the world of online media and, through a largely manual process, ranking the credibility and transparency of news sources around the world. In this paper, we evaluate two procedures for assessing the risk of online me… ▽ More In today's era of information disorder, many organizations are moving to verify the veracity of news published on the web and social media. In particular, some agencies are exploring the world of online media and, through a largely manual process, ranking the credibility and transparency of news sources around the world. In this paper, we evaluate two procedures for assessing the risk of online media exposing their readers to m/disinformation. The procedures have been dictated by NewsGuard and The Global Disinformation Index, two well-known organizations combating d/misinformation via practices of good journalism. Specifically, considering a fixed set of media outlets, we examine how many of them were rated equally by the two procedures, and which aspects led to disagreement in the assessment. The result of our analysis shows a good degree of agreement, which in our opinion has a double value: it fortifies the correctness of the procedures and lays the groundwork for their automation. △ Less

Submitted 5 May, 2022; originally announced May 2022.

Comments: Accepted at MEDIATE 2022. `Misinformation: new directions in automation, real-world applications, and interventions', a workshop @ICWSM 2022

arXiv:2202.03316 [pdf, other]

doi 10.1038/s41598-022-16603-7

Bow-Tie Structures of Twitter Discursive Communities

Authors: Mattia Mattei, Manuel Pratelli, Guido Caldarelli, Marinella Petrocchi, Fabio Saracco

Abstract: In the analysis of Twitter debate, the recent literature focused on discursive communities, i.e. clusters of accounts interacting among themselves via retweets. In the present work, we studied discursive communities in 8 different thematic Twitter datasets in various languages. Surprisingly, we observed that almost all discursive communities therein display a bow-tie structure during political or… ▽ More In the analysis of Twitter debate, the recent literature focused on discursive communities, i.e. clusters of accounts interacting among themselves via retweets. In the present work, we studied discursive communities in 8 different thematic Twitter datasets in various languages. Surprisingly, we observed that almost all discursive communities therein display a bow-tie structure during political or societal debates. Instead, they are absent when the argument of the discussion is different as sport events, as in the case of Euro2020 Turkish and Italian datasets. We furthermore analysed the quality of the content created in the various sectors of the different discursive communities, using the domain annotation from the fact-checking website Newsguard: we observe that, when the discursive community is affected by m/disinformation, the content with the lowest quality is the ones produced and shared in SCC and, in particular, a strong incidence of low- or non-reputable messages is present in the flow of retweets between the SCC and the OUT sectors. In this sense, in discursive communities affected by m/disinformation, the greatest part of the accounts has access to a great variety of contents, but whose quality is, in general, quite low; such a situation perfectly describes the phenomenon of infodemic, i.e. the access to "an excessive amount of information about a problem, which makes it difficult to identify a solution", according to WHO). △ Less

Submitted 28 June, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

Comments: 47 pages, 25 figures, 7 tables

Journal ref: Sci Rep 12, 12944 (2022)

arXiv:2111.12034 [pdf, other]

Adversarial machine learning for protecting against online manipulation

Authors: Stefano Cresci, Marinella Petrocchi, Angelo Spognardi, Stefano Tognazzi

Abstract: Adversarial examples are inputs to a machine learning system that result in an incorrect output from that system. Attacks launched through this type of input can cause severe consequences: for example, in the field of image recognition, a stop signal can be misclassified as a speed limit indication.However, adversarial examples also represent the fuel for a flurry of research directions in differe… ▽ More Adversarial examples are inputs to a machine learning system that result in an incorrect output from that system. Attacks launched through this type of input can cause severe consequences: for example, in the field of image recognition, a stop signal can be misclassified as a speed limit indication.However, adversarial examples also represent the fuel for a flurry of research directions in different domains and applications. Here, we give an overview of how they can be profitably exploited as powerful tools to build stronger learning models, capable of better-withstanding attacks, for two crucial tasks: fake news and social bot detection. △ Less

Submitted 23 November, 2021; originally announced November 2021.

Comments: To appear on IEEE Internet Computing. `Accepted manuscript' version

arXiv:2101.10782 [pdf, other]

A Behavioural Analysis of Credulous Twitter Users

Authors: Alessandro Balestrucci, Rocco De Nicola, Marinella Petrocchi, Catia Trubiani

Abstract: Thanks to platforms such as Twitter and Facebook, people can know facts and events that otherwise would have been silenced. However, social media significantly contribute also to fast spreading biased and false news while targeting specific segments of the population. We have seen how false information can be spread using automated accounts, known as bots. Using Twitter as a benchmark, we investig… ▽ More Thanks to platforms such as Twitter and Facebook, people can know facts and events that otherwise would have been silenced. However, social media significantly contribute also to fast spreading biased and false news while targeting specific segments of the population. We have seen how false information can be spread using automated accounts, known as bots. Using Twitter as a benchmark, we investigate behavioural attitudes of so called `credulous' users, i.e., genuine accounts following many bots. Leveraging our previous work, where supervised learning is successfully applied to single out credulous users, we improve the classification task with a detailed features' analysis and provide evidence that simple and lightweight features are crucial to detect such users. Furthermore, we study the differences in the way credulous and not credulous users interact with bots and discover that credulous users tend to amplify more the content posted by bots and argue that their detection can be instrumental to get useful information on possible dissemination of spam content, propaganda, and, in general, little or no reliable information. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Comments: Under submission

arXiv:2012.13905 [pdf, other]

Improving Opinion Spam Detection by Cumulative Relative Frequency Distribution

Authors: Michela Fazzolari, Francesco Buccafurri, Gianluca Lax, Marinella Petrocchi

Abstract: Over the last years, online reviews became very important since they can influence the purchase decision of consumers and the reputation of businesses, therefore, the practice of writing fake reviews can have severe consequences on customers and service providers. Various approaches have been proposed for detecting opinion spam in online reviews, especially based on supervised classifiers. In this… ▽ More Over the last years, online reviews became very important since they can influence the purchase decision of consumers and the reputation of businesses, therefore, the practice of writing fake reviews can have severe consequences on customers and service providers. Various approaches have been proposed for detecting opinion spam in online reviews, especially based on supervised classifiers. In this contribution, we start from a set of effective features used for classifying opinion spam and we re-engineered them, by considering the Cumulative Relative Frequency Distribution of each feature. By an experimental evaluation carried out on real data from Yelp.com, we show that the use of the distributional features is able to improve the performances of classifiers. △ Less

Submitted 27 December, 2020; originally announced December 2020.

Comments: Manuscript accepted for publication in ACM Journal of Data and Information Quality. This is the pre-final version, before proofs checking

arXiv:2010.01913 [pdf, other]

doi 10.1140/epjds/s13688-021-00289-4

Flow of online misinformation during the peak of the COVID-19 pandemic in Italy

Authors: Guido Caldarelli, Rocco de Nicola, Marinella Petrocchi, Manuel Pratelli, Fabio Saracco

Abstract: The COVID-19 pandemic has impacted on every human activity and, because of the urgency of finding the proper responses to such an unprecedented emergency, it generated a diffused societal debate. The online version of this discussion was not exempted by the presence of d/misinformation campaigns, but differently from what already witnessed in other debates, the COVID-19 -- intentional or not -- fl… ▽ More The COVID-19 pandemic has impacted on every human activity and, because of the urgency of finding the proper responses to such an unprecedented emergency, it generated a diffused societal debate. The online version of this discussion was not exempted by the presence of d/misinformation campaigns, but differently from what already witnessed in other debates, the COVID-19 -- intentional or not -- flow of false information put at severe risk the public health, reducing the effectiveness of governments' countermeasures. In the present manuscript, we study the effective impact of misinformation in the Italian societal debate on Twitter during the pandemic, focusing on the various discursive communities. In order to extract the discursive communities, we focus on verified users, i.e. accounts whose identity is officially certified by Twitter. We thus infer the various discursive communities based on how verified users are perceived by standard ones: if two verified accounts are considered as similar by non unverified ones, we link them in the network of certified accounts. We first observe that, beside being a mostly scientific subject, the COVID-19 discussion show a clear division in what results to be different political groups. At this point, by using a commonly available fact-checking software (NewsGuard), we assess the reputation of the pieces of news exchanged. We filter the network of retweets (i.e. users re-broadcasting the same elementary piece of information, or tweet) from random noise and check the presence of messages displaying an url. The impact of misinformation posts reaches the 22.1% in the right and center-right wing community and its contribution is even stronger in absolute numbers, due to the activity of this group: 96% of all non reputable urls shared by political groups come from this community. △ Less

Submitted 23 February, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: 25 pages, 4 figures. The Abstract, the Introduction, the Results, the Conclusions and the Methods were substantially rewritten. The plot of the network have been changed, as well as tables

Journal ref: EPJ Data Sci. 10, 34 (2021)

arXiv:1909.03851 [pdf, ps, other]

Do you really follow them? Automatic detection of credulous Twitter users

Authors: Alessandro Balestrucci, Rocco De Nicola, Marinella Petrocchi, Catia Trubiani

Abstract: Online Social Media represent a pervasive source of information able to reach a huge audience. Sadly, recent studies show how online social bots (automated, often malicious accounts, populating social networks and mimicking genuine users) are able to amplify the dissemination of (fake) information by orders of magnitude. Using Twitter as a benchmark, in this work we focus on what we define credulo… ▽ More Online Social Media represent a pervasive source of information able to reach a huge audience. Sadly, recent studies show how online social bots (automated, often malicious accounts, populating social networks and mimicking genuine users) are able to amplify the dissemination of (fake) information by orders of magnitude. Using Twitter as a benchmark, in this work we focus on what we define credulous users, i.e., human-operated accounts with a high percentage of bots among their followings. Being more exposed to the harmful activities of social bots, credulous users may run the risk of being more influenced than other users; even worse, although unknowingly, they could become spreaders of misleading information (e.g., by retweeting bots). We design and develop a supervised classifier to automatically recognize credulous users. The best tested configuration achieves an accuracy of 93.27% and AUC-ROC of 0.93, thus leading to positive and encouraging results. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: 8 pages, 2 tables. Accepted for publication at IDEAL 2019 (20th International Conference on Intelligent Data Engineering and Automated Learning, Manchester, UK, 14-16 November, 2019). The present version is the accepted version, and it is not the final published version

arXiv:1905.12687 [pdf, other]

doi 10.1038/s42005-020-0340-4

The role of bot squads in the political propaganda on Twitter

Authors: Guido Caldarelli, Rocco De Nicola, Fabio Del Vigna, Marinella Petrocchi, Fabio Saracco

Abstract: Social Media are nowadays the privileged channel for information spreading and news checking. Unexpectedly for most of the users, automated accounts, also known as social bots, contribute more and more to this process of news spreading. Using Twitter as a benchmark, we consider the traffic exchanged, over one month of observation, on a specific topic, namely the migration flux from Northern Africa… ▽ More Social Media are nowadays the privileged channel for information spreading and news checking. Unexpectedly for most of the users, automated accounts, also known as social bots, contribute more and more to this process of news spreading. Using Twitter as a benchmark, we consider the traffic exchanged, over one month of observation, on a specific topic, namely the migration flux from Northern Africa to Italy. We measure the significant traffic of tweets only, by implementing an entropy-based null model that discounts the activity of users and the virality of tweets. Results show that social bots play a central role in the exchange of significant content. Indeed, not only the strongest hubs have a number of bots among their followers higher than expected, but furthermore a group of them, that can be assigned to the same political tendency, share a common set of bots as followers. The retwitting activity of such automated accounts amplifies the presence on the platform of the hubs' messages. △ Less

Submitted 29 May, 2019; originally announced May 2019.

Comments: Under Submission

Journal ref: Commun Phys 3, 81 (2020)

arXiv:1904.05132 [pdf, other]

Better Safe Than Sorry: An Adversarial Approach to Improve Social Bot Detection

Authors: Stefano Cresci, Marinella Petrocchi, Angelo Spognardi, Stefano Tognazzi

Abstract: The arm race between spambots and spambot-detectors is made of several cycles (or generations): a new wave of spambots is created (and new spam is spread), new spambot filters are derived and old spambots mutate (or evolve) to new species. Recently, with the diffusion of the adversarial learning approach, a new practice is emerging: to manipulate on purpose target samples in order to make stronger… ▽ More The arm race between spambots and spambot-detectors is made of several cycles (or generations): a new wave of spambots is created (and new spam is spread), new spambot filters are derived and old spambots mutate (or evolve) to new species. Recently, with the diffusion of the adversarial learning approach, a new practice is emerging: to manipulate on purpose target samples in order to make stronger detection models. Here, we manipulate generations of Twitter social bots, to obtain - and study - their possible future evolutions, with the aim of eventually deriving more effective detection techniques. In detail, we propose and experiment with a novel genetic algorithm for the synthesis of online accounts. The algorithm allows to create synthetic evolved versions of current state-of-the-art social bots. Results demonstrate that synthetic bots really escape current detection techniques. However, they give all the needed elements to improve such techniques, making possible a proactive approach for the design of social bot detection systems. △ Less

Submitted 10 April, 2019; originally announced April 2019.

Comments: This is the pre-final version of a paper accepted @ 11th ACM Conference on Web Science, June 30-July 3, 2019, Boston, US

arXiv:1804.03433 [pdf, other]

Who framed Roger Reindeer? De-censorship of Facebook posts by snippet classification

Authors: Fabio Del Vigna, Marinella Petrocchi, Alessandro Tommasi, Cesare Zavattari, Maurizio Tesconi

Abstract: This paper considers online news censorship and it concentrates on censorship of identities. Obfuscating identities may occur for disparate reasons, from military to judiciary ones. In the majority of cases, this happens to protect individuals from being identified and persecuted by hostile people. However, being the collaborative web characterised by a redundancy of information, it is not unusual… ▽ More This paper considers online news censorship and it concentrates on censorship of identities. Obfuscating identities may occur for disparate reasons, from military to judiciary ones. In the majority of cases, this happens to protect individuals from being identified and persecuted by hostile people. However, being the collaborative web characterised by a redundancy of information, it is not unusual that the same fact is reported by multiple sources, which may not apply the same restriction policies in terms of censorship. Also, the proven aptitude of social network users to disclose personal information leads to the phenomenon that comments to news can reveal the data withheld in the news itself. This gives us a mean to figure out who the subject of the censored news is. We propose an adaptation of a text analysis approach to unveil censored identities. The approach is tested on a synthesised scenario, which however resembles a real use case. Leveraging a text analysis based on a context classifier trained over snippets from posts and comments of Facebook pages, we achieve promising results. Despite the quite constrained settings in which we operate -- such as considering only snippets of very short length -- our system successfully detects the censored name, choosing among 10 different candidate names, in more than 50\% of the investigated cases. This outperforms the results of two reference baselines. The findings reported in this paper, other than being supported by a thorough experimental methodology and interesting on their own, also pave the way for further investigation on the insidious issues of censorship on the web. △ Less

Submitted 10 April, 2018; originally announced April 2018.

Comments: Accepted for publication: Elsevier Online Social Networks and Media

arXiv:1707.06932 [pdf, other]

doi 10.1007/s12559-017-9496-y

A study on text-score disagreement in online reviews

Authors: Michela Fazzolari, Vittoria Cozza, Marinella Petrocchi, Angelo Spognardi

Abstract: In this paper, we focus on online reviews and employ artificial intelligence tools, taken from the cognitive computing field, to help understanding the relationships between the textual part of the review and the assigned numerical score. We move from the intuitions that 1) a set of textual reviews expressing different sentiments may feature the same score (and vice-versa); and 2) detecting and an… ▽ More In this paper, we focus on online reviews and employ artificial intelligence tools, taken from the cognitive computing field, to help understanding the relationships between the textual part of the review and the assigned numerical score. We move from the intuitions that 1) a set of textual reviews expressing different sentiments may feature the same score (and vice-versa); and 2) detecting and analyzing the mismatches between the review content and the actual score may benefit both service providers and consumers, by highlighting specific factors of satisfaction (and dissatisfaction) in texts. To prove the intuitions, we adopt sentiment analysis techniques and we concentrate on hotel reviews, to find polarity mismatches therein. In particular, we first train a text classifier with a set of annotated hotel reviews, taken from the Booking website. Then, we analyze a large dataset, with around 160k hotel reviews collected from Tripadvisor, with the aim of detecting a polarity mismatch, indicating if the textual content of the review is in line, or not, with the associated score. Using well established artificial intelligence techniques and analyzing in depth the reviews featuring a mismatch between the text polarity and the score, we find that -on a scale of five stars- those reviews ranked with middle scores include a mixture of positive and negative aspects. The approach proposed here, beside acting as a polarity detector, provides an effective selection of reviews -on an initial very large dataset- that may allow both consumers and providers to focus directly on the review subset featuring a text/score disagreement, which conveniently convey to the user a summary of positive and negative features of the review target. △ Less

Submitted 21 July, 2017; originally announced July 2017.

Comments: This is the accepted version of the paper. The final version will be published in the Journal of Cognitive Computation, available at Springer via http://dx.doi.org/10.1007/s12559-017-9496-y

arXiv:1704.05393 [pdf, other]

Mining Worse and Better Opinions. Unsupervised and Agnostic Aggregation of Online Reviews

Authors: Michela Fazzolari, Marinella Petrocchi, Alessandro Tommasi, Cesare Zavattari

Abstract: In this paper, we propose a novel approach for aggregating online reviews, according to the opinions they express. Our methodology is unsupervised - due to the fact that it does not rely on pre-labeled reviews - and it is agnostic - since it does not make any assumption about the domain or the language of the review content. We measure the adherence of a review content to the domain terminology ex… ▽ More In this paper, we propose a novel approach for aggregating online reviews, according to the opinions they express. Our methodology is unsupervised - due to the fact that it does not rely on pre-labeled reviews - and it is agnostic - since it does not make any assumption about the domain or the language of the review content. We measure the adherence of a review content to the domain terminology extracted from a review set. First, we demonstrate the informativeness of the adherence metric with respect to the score associated with a review. Then, we exploit the metric values to group reviews, according to the opinions they express. Our experimental campaign has been carried out on two large datasets collected from Booking and Amazon, respectively. △ Less

Submitted 18 April, 2017; originally announced April 2017.

arXiv:1703.04482 [pdf, other]

doi 10.1109/TDSC.2017.2681672

Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling

Authors: Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, Maurizio Tesconi

Abstract: Spambot detection in online social networks is a long-lasting challenge involving the study and design of detection techniques capable of efficiently identifying ever-evolving spammers. Recently, a new wave of social spambots has emerged, with advanced human-like characteristics that allow them to go undetected even by current state-of-the-art algorithms. In this paper, we show that efficient spam… ▽ More Spambot detection in online social networks is a long-lasting challenge involving the study and design of detection techniques capable of efficiently identifying ever-evolving spammers. Recently, a new wave of social spambots has emerged, with advanced human-like characteristics that allow them to go undetected even by current state-of-the-art algorithms. In this paper, we show that efficient spambots detection can be achieved via an in-depth analysis of their collective behaviors exploiting the digital DNA technique for modeling the behaviors of social network users. Inspired by its biological counterpart, in the digital DNA representation the behavioral lifetime of a digital account is encoded in a sequence of characters. Then, we define a similarity measure for such digital DNA sequences. We build upon digital DNA and the similarity between groups of users to characterize both genuine accounts and spambots. Leveraging such characterization, we design the Social Fingerprinting technique, which is able to discriminate among spambots and genuine accounts in both a supervised and an unsupervised fashion. We finally evaluate the effectiveness of Social Fingerprinting and we compare it with three state-of-the-art detection algorithms. Among the peculiarities of our approach is the possibility to apply off-the-shelf DNA analysis techniques to study online users behaviors and to efficiently rely on a limited number of lightweight account characteristics. △ Less

Submitted 13 March, 2017; originally announced March 2017.

Journal ref: IEEE Transactions on Dependable and Secure Computing 15(4):561-576, 2018

arXiv:1701.03017 [pdf, other]

doi 10.1145/3041021.3055135

The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race

Authors: Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, Maurizio Tesconi

Abstract: Recent studies in social media spam and automation provide anecdotal argumentation of the rise of a new generation of spambots, so-called social spambots. Here, for the first time, we extensively study this novel phenomenon on Twitter and we provide quantitative evidence that a paradigm-shift exists in spambot design. First, we measure current Twitter's capabilities of detecting the new social spa… ▽ More Recent studies in social media spam and automation provide anecdotal argumentation of the rise of a new generation of spambots, so-called social spambots. Here, for the first time, we extensively study this novel phenomenon on Twitter and we provide quantitative evidence that a paradigm-shift exists in spambot design. First, we measure current Twitter's capabilities of detecting the new social spambots. Later, we assess the human performance in discriminating between genuine accounts, social spambots, and traditional spambots. Then, we benchmark several state-of-the-art techniques proposed by the academic literature. Results show that neither Twitter, nor humans, nor cutting-edge applications are currently capable of accurately detecting the new social spambots. Our results call for new approaches capable of turning the tide in the fight against this raising phenomenon. We conclude by reviewing the latest literature on spambots detection and we highlight an emerging common research trend based on the analysis of collective behaviors. Insights derived from both our extensive experimental campaign and survey shed light on the most promising directions of research and lay the foundations for the arms race against the novel social spambots. Finally, to foster research on this novel phenomenon, we make publicly available to the scientific community all the datasets used in this study. △ Less

Submitted 30 March, 2023; v1 submitted 11 January, 2017; originally announced January 2017.

Comments: Post-print of the article published in the Proceedings of 26th WWW, 2017, Companion Volume (Web Science Track, Perth, Australia, 3-7 April, 2017)

Journal ref: Proceedings of the 26th International Conference on World Wide Web Companion, 2017

arXiv:1609.06577 [pdf, other]

Semi-supervised knowledge extraction for detection of drugs and their effects

Authors: Fabio Del Vigna, Marinella Petrocchi, Alessandro Tommasi, Cesare Zavattari, Maurizio Tesconi

Abstract: New Psychoactive Substances (NPS) are drugs that lay in a grey area of legislation, since they are not internationally and officially banned, possibly leading to their not prosecutable trade. The exacerbation of the phenomenon is that NPS can be easily sold and bought online. Here, we consider large corpora of textual posts, published on online forums specialized on drug discussions, plus a small… ▽ More New Psychoactive Substances (NPS) are drugs that lay in a grey area of legislation, since they are not internationally and officially banned, possibly leading to their not prosecutable trade. The exacerbation of the phenomenon is that NPS can be easily sold and bought online. Here, we consider large corpora of textual posts, published on online forums specialized on drug discussions, plus a small set of known substances and associated effects, which we call seeds. We propose a semi-supervised approach to knowledge extraction, applied to the detection of drugs (comprising NPS) and effects from the corpora under investigation. Based on the very small set of initial seeds, the work highlights how a contrastive approach and context deduction are effective in detecting substances and effects from the corpora. Our promising results, which feature a F1 score close to 0.9, pave the way for shortening the detection time of new psychoactive substances, once these are discussed and advertised on the Internet. △ Less

Submitted 21 September, 2016; originally announced September 2016.

Comments: 14 pages excluding references

arXiv:1605.03817 [pdf, other]

Spotting the diffusion of New Psychoactive Substances over the Internet

Authors: Fabio Del Vigna, Marco Avvenuti, Clara Bacciu, Paolo Deluca, Andrea Marchetti, Marinella Petrocchi, Maurizio Tesconi

Abstract: Online availability and diffusion of New Psychoactive Substances (NPS) represent an emerging threat to healthcare systems. In this work, we analyse drugs forums, online shops, and Twitter. By mining the data from these sources, it is possible to understand the dynamics of drugs diffusion and their endorsement, as well as timely detecting new substances. We propose a set of visual analytics tools t… ▽ More Online availability and diffusion of New Psychoactive Substances (NPS) represent an emerging threat to healthcare systems. In this work, we analyse drugs forums, online shops, and Twitter. By mining the data from these sources, it is possible to understand the dynamics of drugs diffusion and their endorsement, as well as timely detecting new substances. We propose a set of visual analytics tools to support analysts in tackling NPS spreading and provide a better insight about drugs market and analysis. △ Less

Submitted 11 July, 2016; v1 submitted 12 May, 2016; originally announced May 2016.

arXiv:1603.01987 [pdf, other]

A matter of words: NLP for quality evaluation of Wikipedia medical articles

Authors: Vittoria Cozza, Marinella Petrocchi, Angelo Spognardi

Abstract: Automatic quality evaluation of Web information is a task with many fields of applications and of great relevance, especially in critical domains like the medical one. We move from the intuition that the quality of content of medical Web documents is affected by features related with the specific domain. First, the usage of a specific vocabulary (Domain Informativeness); then, the adoption of spec… ▽ More Automatic quality evaluation of Web information is a task with many fields of applications and of great relevance, especially in critical domains like the medical one. We move from the intuition that the quality of content of medical Web documents is affected by features related with the specific domain. First, the usage of a specific vocabulary (Domain Informativeness); then, the adoption of specific codes (like those used in the infoboxes of Wikipedia articles) and the type of document (e.g., historical and technical ones). In this paper, we propose to leverage specific domain features to improve the results of the evaluation of Wikipedia medical articles. In particular, we evaluate the articles adopting an "actionable" model, whose features are related to the content of the articles, so that the model can also directly suggest strategies for improving a given article quality. We rely on Natural Language Processing (NLP) and dictionaries-based techniques in order to extract the bio-medical concepts in a text. We prove the effectiveness of our approach by classifying the medical articles of the Wikipedia Medicine Portal, which have been previously manually labeled by the Wiki Project team. The results of our experiments confirm that, by considering domain-oriented features, it is possible to obtain sensible improvements with respect to existing solutions, mainly for those articles that other approaches have less correctly classified. Other than being interesting by their own, the results call for further research in the area of domain specific features suitable for Web data quality assessment. △ Less

Submitted 7 March, 2016; originally announced March 2016.

arXiv:1602.00110 [pdf, other]

doi 10.1109/MIS.2016.29

DNA-inspired online behavioral modeling and its application to spambot detection

Authors: Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, Maurizio Tesconi

Abstract: We propose a strikingly novel, simple, and effective approach to model online user behavior: we extract and analyze digital DNA sequences from user online actions and we use Twitter as a benchmark to test our proposal. We obtain an incisive and compact DNA-inspired characterization of user actions. Then, we apply standard DNA analysis techniques to discriminate between genuine and spambot accounts… ▽ More We propose a strikingly novel, simple, and effective approach to model online user behavior: we extract and analyze digital DNA sequences from user online actions and we use Twitter as a benchmark to test our proposal. We obtain an incisive and compact DNA-inspired characterization of user actions. Then, we apply standard DNA analysis techniques to discriminate between genuine and spambot accounts on Twitter. An experimental campaign supports our proposal, showing its effectiveness and viability. To the best of our knowledge, we are the first ones to identify and adapt DNA-inspired techniques to online user behavioral modeling. While Twitter spambot detection is a specific use case on a specific social media, our proposed methodology is platform and technology agnostic, hence paving the way for diverse behavioral characterization tasks. △ Less

Submitted 30 January, 2016; originally announced February 2016.

ACM Class: H.2.8.d; I.2.4

Journal ref: IEEE Intelligent Systems 31(5):58-64, 2016

arXiv:1510.04031 [pdf, other]

doi 10.1109/WIFS.2015.7368607

TRAP: using TaRgeted Ads to unveil Google personal Profiles

Authors: Mauro Conti, Vittoria Cozza, Marinella Petrocchi, Angelo Spognardi

Abstract: In the last decade, the advertisement market spread significantly in the web and mobile app system. Its effectiveness is also due thanks to the possibility to target the advertisement on the specific interests of the actual user, other than on the content of the website hosting the advertisement. In this scenario, became of great value services that collect and hence can provide information about… ▽ More In the last decade, the advertisement market spread significantly in the web and mobile app system. Its effectiveness is also due thanks to the possibility to target the advertisement on the specific interests of the actual user, other than on the content of the website hosting the advertisement. In this scenario, became of great value services that collect and hence can provide information about the browsing user, like Facebook and Google. In this paper, we show how to maliciously exploit the Google Targeted Advertising system to infer personal information in Google user profiles. In particular, the attack we consider is external from Google and relies on combining data from Google AdWords with other data collected from a website of the Google Display Network. We validate the effectiveness of our proposed attack, also discussing possible application scenarios. The result of our research shows a significant practical privacy issue behind such type of targeted advertising service, and call for further investigation and the design of more privacy-aware solutions, possibly without impeding the current business model involved in online advertisement. △ Less

Submitted 14 October, 2015; originally announced October 2015.

Comments: 7th IEEE International Workshop on Information Forensics and Security (WIFS) 2015. 6 pages

arXiv:1509.04098 [pdf, ps, other]

doi 10.1016/j.dss.2015.09.003

Fame for sale: efficient detection of fake Twitter followers

Authors: Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, Maurizio Tesconi

Abstract: $\textit{Fake followers}… ▽ More $\textit{Fake followers}$ are those Twitter accounts specifically created to inflate the number of followers of a target account. Fake followers are dangerous for the social platform and beyond, since they may alter concepts like popularity and influence in the Twittersphere - hence impacting on economy, politics, and society. In this paper, we contribute along different dimensions. First, we review some of the most relevant existing features and rules (proposed by Academia and Media) for anomalous Twitter accounts detection. Second, we create a baseline dataset of verified human and fake follower accounts. Such baseline dataset is publicly available to the scientific community. Then, we exploit the baseline dataset to train a set of machine-learning classifiers built over the reviewed rules and features. Our results show that most of the rules proposed by Media provide unsatisfactory performance in revealing fake followers, while features proposed in the past by Academia for spam detection provide good results. Building on the most promising features, we revise the classifiers both in terms of reduction of overfitting and cost for gathering the data needed to compute the features. The final result is a novel $\textit{Class A}$ classifier, general enough to thwart overfitting, lightweight thanks to the usage of the less costly features, and still able to correctly classify more than 95% of the accounts of the original training set. We ultimately perform an information fusion-based sensitivity analysis, to assess the global sensitivity of each of the features employed by the classifier. The findings reported in this paper, other than being supported by a thorough experimental methodology and interesting on their own, also pave the way for further investigation on the novel issue of fake Twitter followers. △ Less

Submitted 10 November, 2015; v1 submitted 14 September, 2015; originally announced September 2015.

ACM Class: H.2.8

Journal ref: Decision Support Systems, 80, 56-71, 2015

arXiv:1508.03902 [pdf, other]

doi 10.4204/EPTCS.188.6

Domain-specific queries and Web search personalization: some investigations

Authors: Van Tien Hoang, Angelo Spognardi, Francesco Tiezzi, Marinella Petrocchi, Rocco De Nicola

Abstract: Major search engines deploy personalized Web results to enhance users' experience, by showing them data supposed to be relevant to their interests. Even if this process may bring benefits to users while browsing, it also raises concerns on the selection of the search results. In particular, users may be unknowingly trapped by search engines in protective information bubbles, called "filter bubbles… ▽ More Major search engines deploy personalized Web results to enhance users' experience, by showing them data supposed to be relevant to their interests. Even if this process may bring benefits to users while browsing, it also raises concerns on the selection of the search results. In particular, users may be unknowingly trapped by search engines in protective information bubbles, called "filter bubbles", which can have the undesired effect of separating users from information that does not fit their preferences. This paper moves from early results on quantification of personalization over Google search query results. Inspired by previous works, we have carried out some experiments consisting of search queries performed by a battery of Google accounts with differently prepared profiles. Matching query results, we quantify the level of personalization, according to topics of the queries and the profile of the accounts. This work reports initial results and it is a first step a for more extensive investigation to measure Web search personalization. △ Less

Submitted 16 August, 2015; originally announced August 2015.

Comments: In Proceedings WWV 2015, arXiv:1508.03389

Journal ref: EPTCS 188, 2015, pp. 51-58

Showing 1–28 of 28 results for author: Petrocchi, M