-
Predicting Sentence-Level Factuality of News and Bias of Media Outlets
Authors:
Francielle Vargas,
Kokil Jaidka,
Thiago A. S. Pardo,
Fabrício Benevenuto
Abstract:
Automated news credibility and fact-checking at scale require accurately predicting news factuality and media bias. This paper introduces a large sentence-level dataset, titled "FactNews", composed of 6,191 sentences expertly annotated according to factuality and media bias definitions proposed by AllSides. We use FactNews to assess the overall reliability of news sources, by formulating two text…
▽ More
Automated news credibility and fact-checking at scale require accurately predicting news factuality and media bias. This paper introduces a large sentence-level dataset, titled "FactNews", composed of 6,191 sentences expertly annotated according to factuality and media bias definitions proposed by AllSides. We use FactNews to assess the overall reliability of news sources, by formulating two text classification problems for predicting sentence-level factuality of news reporting and bias of media outlets. Our experiments demonstrate that biased sentences present a higher number of words compared to factual sentences, besides having a predominance of emotions. Hence, the fine-grained analysis of subjectivity and impartiality of news articles provided promising results for predicting the reliability of media outlets. Finally, due to the severity of fake news and political polarization in Brazil, and the lack of research for Portuguese, both dataset and baseline were proposed for Brazilian Portuguese.
△ Less
Submitted 28 June, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
Contextual-Lexicon Approach for Abusive Language Detection
Authors:
Francielle Vargas,
Fabiana Rodrigues de Góes,
Isabelle Carvalho,
Fabrício Benevenuto,
Thiago Alexandre Salgueiro Pardo
Abstract:
Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media. Our approach embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the sever…
▽ More
Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media. Our approach embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the models. Nevertheless, our method may be applied to any other language. The conducted experiments show the effectiveness of the proposed approach, outperforming the current baseline methods for the Portuguese language.
△ Less
Submitted 20 December, 2022; v1 submitted 25 April, 2021;
originally announced April 2021.
-
HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection
Authors:
Francielle Alves Vargas,
Isabelle Carvalho,
Fabiana Rodrigues de Góes,
Fabrício Benevenuto,
Thiago Alexandre Salgueiro Pardo
Abstract:
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by…
▽ More
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.
△ Less
Submitted 27 December, 2022; v1 submitted 27 March, 2021;
originally announced March 2021.
-
Studying Dishonest Intentions in Brazilian Portuguese Texts
Authors:
Francielle Alves Vargas,
Thiago Alexandre Salgueiro Pardo
Abstract:
Previous work in the social sciences, psychology and linguistics has show that liars have some control over the content of their stories, however their underlying state of mind may "leak out" through the way that they tell them. To the best of our knowledge, no previous systematic effort exists in order to describe and model deception language for Brazilian Portuguese. To fill this important gap,…
▽ More
Previous work in the social sciences, psychology and linguistics has show that liars have some control over the content of their stories, however their underlying state of mind may "leak out" through the way that they tell them. To the best of our knowledge, no previous systematic effort exists in order to describe and model deception language for Brazilian Portuguese. To fill this important gap, we carry out an initial empirical linguistic study on false statements in Brazilian news. We methodically analyze linguistic features using a deceptive news corpus, which includes both fake and true news. The results show that they present substantial lexical, syntactic and semantic variations, as well as punctuation and emotion distinctions.
△ Less
Submitted 1 April, 2021; v1 submitted 13 August, 2020;
originally announced August 2020.
-
SEMA: an Extended Semantic Evaluation Metric for AMR
Authors:
Rafael T. Anchieta,
Marco A. S. Cabezudo,
Thiago A. S. Pardo
Abstract:
Abstract Meaning Representation (AMR) is a recently designed semantic representation language intended to capture the meaning of a sentence, which may be represented as a single-rooted directed acyclic graph with labeled nodes and edges. The automatic evaluation of this structure plays an important role in the development of better systems, as well as for semantic annotation. Despite there is one…
▽ More
Abstract Meaning Representation (AMR) is a recently designed semantic representation language intended to capture the meaning of a sentence, which may be represented as a single-rooted directed acyclic graph with labeled nodes and edges. The automatic evaluation of this structure plays an important role in the development of better systems, as well as for semantic annotation. Despite there is one available metric, smatch, it has some drawbacks. For instance, smatch creates a self-relation on the root of the graph, has weights for different error types, and does not take into account the dependence of the elements in the AMR structure. With these drawbacks, smatch masks several problems of the AMR parsers and distorts the evaluation of the AMRs. In view of this, in this paper, we introduce an extended metric to evaluate AMR parsers, which deals with the drawbacks of the smatch metric. Finally, we compare both metrics, using four well-known AMR parsers, and we argue that our metric is more refined, robust, fairer, and faster than smatch.
△ Less
Submitted 28 May, 2019;
originally announced May 2019.