A Systematic Review on the Detection of Fake News Articles
Authors:
Nathaniel Hoy,
Theodora Koulouri
Abstract:
It has been argued that fake news and the spread of false information pose a threat to societies throughout the world, from influencing the results of elections to hindering the efforts to manage the COVID-19 pandemic. To combat this threat, a number of Natural Language Processing (NLP) approaches have been developed. These leverage a number of datasets, feature extraction/selection techniques and…
▽ More
It has been argued that fake news and the spread of false information pose a threat to societies throughout the world, from influencing the results of elections to hindering the efforts to manage the COVID-19 pandemic. To combat this threat, a number of Natural Language Processing (NLP) approaches have been developed. These leverage a number of datasets, feature extraction/selection techniques and machine learning (ML) algorithms to detect fake news before it spreads. While these methods are well-documented, there is less evidence regarding their efficacy in this domain. By systematically reviewing the literature, this paper aims to delineate the approaches for fake news detection that are most performant, identify limitations with existing approaches, and suggest ways these can be mitigated. The analysis of the results indicates that Ensemble Methods using a combination of news content and socially-based features are currently the most effective. Finally, it is proposed that future research should focus on develo** approaches that address generalisability issues (which, in part, arise from limitations with current datasets), explainability and bias.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Authors:
Steffen Herbold,
Alexander Trautsch,
Benjamin Ledel,
Alireza Aghamohammadi,
Taher Ahmed Ghaleb,
Kuljit Kaur Chahal,
Tim Bossenmaier,
Bhaveet Nagaria,
Philip Makedonski,
Matin Nili Ahmadabadi,
Kristof Szabados,
Helge Spieker,
Matej Madeja,
Nathaniel Hoy,
Valentina Lenarduzzi,
Shangwen Wang,
Gema Rodríguez-Pérez,
Ricardo Colomo-Palacios,
Roberto Verdecchia,
Paramvir Singh,
Yihao Qin,
Debasish Chakroborti,
Willard Davis,
Vijay Walunj,
Hongjun Wu
, et al. (23 additional authors not shown)
Abstract:
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.
Metho…
▽ More
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.
Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
△ Less
Submitted 13 October, 2021; v1 submitted 12 November, 2020;
originally announced November 2020.