-
Identifying Offensive Expressions of Opinion in Context
Authors:
Francielle Alves Vargas,
Isabelle Carvalho,
Fabiana Rodrigues de Góes
Abstract:
Classic information extraction techniques consist in building questions and answers about the facts. Indeed, it is still a challenge to subjective information extraction systems to identify opinions and feelings in context. In sentiment-based NLP tasks, there are few resources to information extraction, above all offensive or hateful opinions in context. To fill this important gap, this short pape…
▽ More
Classic information extraction techniques consist in building questions and answers about the facts. Indeed, it is still a challenge to subjective information extraction systems to identify opinions and feelings in context. In sentiment-based NLP tasks, there are few resources to information extraction, above all offensive or hateful opinions in context. To fill this important gap, this short paper provides a new cross-lingual and contextual offensive lexicon, which consists of explicit and implicit offensive and swearing expressions of opinion, which were annotated in two different classes: context dependent and context-independent offensive. In addition, we provide markers to identify hate speech. Annotation approach was evaluated at the expression-level and achieves high human inter-annotator agreement. The provided offensive lexicon is available in Portuguese and English languages.
△ Less
Submitted 21 September, 2022; v1 submitted 25 April, 2021;
originally announced April 2021.
-
HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection
Authors:
Francielle Alves Vargas,
Isabelle Carvalho,
Fabiana Rodrigues de Góes,
Fabrício Benevenuto,
Thiago Alexandre Salgueiro Pardo
Abstract:
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by…
▽ More
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.
△ Less
Submitted 27 December, 2022; v1 submitted 27 March, 2021;
originally announced March 2021.
-
Studying Dishonest Intentions in Brazilian Portuguese Texts
Authors:
Francielle Alves Vargas,
Thiago Alexandre Salgueiro Pardo
Abstract:
Previous work in the social sciences, psychology and linguistics has show that liars have some control over the content of their stories, however their underlying state of mind may "leak out" through the way that they tell them. To the best of our knowledge, no previous systematic effort exists in order to describe and model deception language for Brazilian Portuguese. To fill this important gap,…
▽ More
Previous work in the social sciences, psychology and linguistics has show that liars have some control over the content of their stories, however their underlying state of mind may "leak out" through the way that they tell them. To the best of our knowledge, no previous systematic effort exists in order to describe and model deception language for Brazilian Portuguese. To fill this important gap, we carry out an initial empirical linguistic study on false statements in Brazilian news. We methodically analyze linguistic features using a deceptive news corpus, which includes both fake and true news. The results show that they present substantial lexical, syntactic and semantic variations, as well as punctuation and emotion distinctions.
△ Less
Submitted 1 April, 2021; v1 submitted 13 August, 2020;
originally announced August 2020.