-
Contextual-Lexicon Approach for Abusive Language Detection
Authors:
Francielle Vargas,
Fabiana Rodrigues de Góes,
Isabelle Carvalho,
Fabrício Benevenuto,
Thiago Alexandre Salgueiro Pardo
Abstract:
Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media. Our approach embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the sever…
▽ More
Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media. Our approach embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the models. Nevertheless, our method may be applied to any other language. The conducted experiments show the effectiveness of the proposed approach, outperforming the current baseline methods for the Portuguese language.
△ Less
Submitted 20 December, 2022; v1 submitted 25 April, 2021;
originally announced April 2021.
-
Identifying Offensive Expressions of Opinion in Context
Authors:
Francielle Alves Vargas,
Isabelle Carvalho,
Fabiana Rodrigues de Góes
Abstract:
Classic information extraction techniques consist in building questions and answers about the facts. Indeed, it is still a challenge to subjective information extraction systems to identify opinions and feelings in context. In sentiment-based NLP tasks, there are few resources to information extraction, above all offensive or hateful opinions in context. To fill this important gap, this short pape…
▽ More
Classic information extraction techniques consist in building questions and answers about the facts. Indeed, it is still a challenge to subjective information extraction systems to identify opinions and feelings in context. In sentiment-based NLP tasks, there are few resources to information extraction, above all offensive or hateful opinions in context. To fill this important gap, this short paper provides a new cross-lingual and contextual offensive lexicon, which consists of explicit and implicit offensive and swearing expressions of opinion, which were annotated in two different classes: context dependent and context-independent offensive. In addition, we provide markers to identify hate speech. Annotation approach was evaluated at the expression-level and achieves high human inter-annotator agreement. The provided offensive lexicon is available in Portuguese and English languages.
△ Less
Submitted 21 September, 2022; v1 submitted 25 April, 2021;
originally announced April 2021.
-
HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection
Authors:
Francielle Alves Vargas,
Isabelle Carvalho,
Fabiana Rodrigues de Góes,
Fabrício Benevenuto,
Thiago Alexandre Salgueiro Pardo
Abstract:
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by…
▽ More
Due to the severity of the social media offensive and hateful comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection. The HateBR corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by specialists, reaching a high inter-annotator agreement. The corpus consists of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology for the dictatorship, antisemitism, and fatphobia). We also implemented baseline experiments for offensive language and hate speech detection and compared them with a literature baseline. Results show that the baseline experiments on our corpus outperform the current state-of-the-art for the Portuguese language.
△ Less
Submitted 27 December, 2022; v1 submitted 27 March, 2021;
originally announced March 2021.