-
Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation
Authors:
Damián Ariel Furman,
Juan Junqueras,
Z. Burçe Gümüslü,
Edgar Altszyler,
Joaquin Navajas,
Ophelia Deroy,
Justin Sulik
Abstract:
We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and…
▽ More
We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and explore the impact of artificially augmented data using in-context learning with GPT4 and GPT3.5-Turbo. We publish the dataset and the trained models along with the annotation manual used to train annotators and define the task.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Which Argumentative Aspects of Hate Speech in Social Media can be reliably identified?
Authors:
Damián Furman,
Pablo Torres,
José A. Rodríguez,
Diego Letzen,
Vanina Martínez,
Laura Alonso Alemany
Abstract:
With the increasing diversity of use cases of large language models, a more informative treatment of texts seems necessary. An argumentative analysis could foster a more reasoned usage of chatbots, text completion mechanisms or other applications. However, it is unclear which aspects of argumentation can be reliably identified and integrated in language models. In this paper, we present an empiric…
▽ More
With the increasing diversity of use cases of large language models, a more informative treatment of texts seems necessary. An argumentative analysis could foster a more reasoned usage of chatbots, text completion mechanisms or other applications. However, it is unclear which aspects of argumentation can be reliably identified and integrated in language models. In this paper, we present an empirical assessment of the reliability with which different argumentative aspects can be automatically identified in hate speech in social media. We have enriched the Hateval corpus (Basile et al. 2019) with a manual annotation of some argumentative components, adapted from Wagemans (2016)'s Periodic Table of Arguments. We show that some components can be identified with reasonable reliability. For those that present a high error ratio, we analyze the patterns of disagreement between expert annotators and errors in automatic procedures, and we propose adaptations of those categories that can be more reliably reproduced.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models
Authors:
Tim Schott,
Daniel Furman,
Shreshta Bhat
Abstract:
In this work, we assess the ability of foundation models to recall encyclopedic knowledge across a wide range of linguistic contexts. To support this, we: 1) produce a 20-language dataset that contains 303k factual associations paired with counterfactuals, 2) evaluate 5 models in a multilingual test, and 3) benchmark a diverse set of 24 models in an English-only test. Meta's LLaMA achieves the hig…
▽ More
In this work, we assess the ability of foundation models to recall encyclopedic knowledge across a wide range of linguistic contexts. To support this, we: 1) produce a 20-language dataset that contains 303k factual associations paired with counterfactuals, 2) evaluate 5 models in a multilingual test, and 3) benchmark a diverse set of 24 models in an English-only test. Meta's LLaMA achieves the highest scores in both multilingual and English-only evaluations. Yet, an analysis of LLaMA's errors reveals significant limitations in its ability to recall facts in languages other than English, plus difficulties related to the location and gender of fact subjects. Overall, our findings suggest that today's foundation models are far from polyglots.
△ Less
Submitted 5 December, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Quantum Sparse Coding
Authors:
Yaniv Romano,
Harel Primack,
Talya Vaknin,
Idan Meirzada,
Ilan Karpas,
Dov Furman,
Chene Tradonsky,
Ruti Ben Shlomi
Abstract:
The ultimate goal of any sparse coding method is to accurately recover from a few noisy linear measurements, an unknown sparse vector. Unfortunately, this estimation problem is NP-hard in general, and it is therefore always approached with an approximation method, such as lasso or orthogonal matching pursuit, thus trading off accuracy for less computational complexity. In this paper, we develop a…
▽ More
The ultimate goal of any sparse coding method is to accurately recover from a few noisy linear measurements, an unknown sparse vector. Unfortunately, this estimation problem is NP-hard in general, and it is therefore always approached with an approximation method, such as lasso or orthogonal matching pursuit, thus trading off accuracy for less computational complexity. In this paper, we develop a quantum-inspired algorithm for sparse coding, with the premise that the emergence of quantum computers and Ising machines can potentially lead to more accurate estimations compared to classical approximation methods. To this end, we formulate the most general sparse coding problem as a quadratic unconstrained binary optimization (QUBO) task, which can be efficiently minimized using quantum technology. To derive at a QUBO model that is also efficient in terms of the number of spins (space complexity), we separate our analysis into three different scenarios. These are defined by the number of bits required to express the underlying sparse vector: binary, 2-bit, and a general fixed-point representation. We conduct numerical experiments with simulated data on LightSolver's quantum-inspired digital platform to verify the correctness of our QUBO formulation and to demonstrate its advantage over baseline methods.
△ Less
Submitted 8 September, 2022;
originally announced September 2022.
-
A Spanish dataset for Targeted Sentiment Analysis of political headlines
Authors:
Tomás Alves Salgueiro,
Emilio Recart Zapata,
Damián Furman,
Juan Manuel Pérez,
Pablo Nicolás Fernández Larrosa
Abstract:
Subjective texts have been studied by several works as they can induce certain behaviours in their users. Most work focuses on user-generated texts in social networks, but some other texts also comprise opinions on certain topics and could influence judgement criteria during political decisions. In this work, we address the task of Targeted Sentiment Analysis for the domain of news headlines, publ…
▽ More
Subjective texts have been studied by several works as they can induce certain behaviours in their users. Most work focuses on user-generated texts in social networks, but some other texts also comprise opinions on certain topics and could influence judgement criteria during political decisions. In this work, we address the task of Targeted Sentiment Analysis for the domain of news headlines, published by the main outlets during the 2019 Argentinean Presidential Elections. For this purpose, we present a polarity dataset of 1,976 headlines mentioning candidates in the 2019 elections at the target level. Preliminary experiments with state-of-the-art classification algorithms based on pre-trained linguistic models suggest that target information is helpful for this task. We make our data and pre-trained models publicly available.
△ Less
Submitted 29 August, 2022;
originally announced August 2022.
-
Parsimonious Argument Annotations for Hate Speech Counter-narratives
Authors:
Damian A. Furman,
Pablo Torres,
Jose A. Rodriguez,
Lautaro Martinez,
Laura Alonso Alemany,
Diego Letzen,
Maria Vanina Martinez
Abstract:
We present an enrichment of the Hateval corpus of hate speech tweets (Basile et. al 2019) aimed to facilitate automated counter-narrative generation. Comparably to previous work (Chung et. al. 2019), manually written counter-narratives are associated to tweets. However, this information alone seems insufficient to obtain satisfactory language models for counter-narrative generation. That is why we…
▽ More
We present an enrichment of the Hateval corpus of hate speech tweets (Basile et. al 2019) aimed to facilitate automated counter-narrative generation. Comparably to previous work (Chung et. al. 2019), manually written counter-narratives are associated to tweets. However, this information alone seems insufficient to obtain satisfactory language models for counter-narrative generation. That is why we have also annotated tweets with argumentative information based on Wagemanns (2016), that we believe can help in building convincing and effective counter-narratives for hate speech against particular groups.
We discuss adequacies and difficulties of this annotation process and present several baselines for automatic detection of the annotated elements. Preliminary results show that automatic annotators perform close to human annotators to detect some aspects of argumentation, while others only reach low or moderate level of inter-annotator agreement.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
LightSolver -- A New Quantum-inspired Solver Cracks the 3-Regular 3-XORSAT Challenge
Authors:
Idan Meirzada,
Assaf Kalinski,
Dov Furman,
Tsafrir Armon,
Talya Vaknin,
Harel Primack,
Chene Tradonsky,
Ruti Ben-Shlomi
Abstract:
The increasing complexity of required computational tasks alongside the inherent limitations in conventional computing calls for disruptive innovation. LightSolver devised a new quantum-inspired computing paradigm, which utilizes an all-optical platform for solving hard optimization problems. In this work, LightSolver introduces its digital simulator and joins the 3-Regular 3-XORSAT (3R3X) challen…
▽ More
The increasing complexity of required computational tasks alongside the inherent limitations in conventional computing calls for disruptive innovation. LightSolver devised a new quantum-inspired computing paradigm, which utilizes an all-optical platform for solving hard optimization problems. In this work, LightSolver introduces its digital simulator and joins the 3-Regular 3-XORSAT (3R3X) challenge, which aims to map the best available state-of-the-art classical and quantum solvers. So far, the challenge has resulted in a clear exponential barrier in terms of time-to-solution (TTS), preventing the inspected platforms from solving problems larger than a few hundred variables. LightSolver's simulator is the first to break the exponential barrier, outperforming both classical and quantum platforms by several orders-of-magnitude and extending the maximal problem size to more than 16,000 variables.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
RoBERTuito: a pre-trained language model for social media text in Spanish
Authors:
Juan Manuel Pérez,
Damián A. Furman,
Laura Alonso Alemany,
Franco Luque
Abstract:
Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significan…
▽ More
Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks. However, for languages other than English such models are not widely available.
In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model achieves top results for some English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and has also competitive performance against monolingual models in English tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.
△ Less
Submitted 4 May, 2022; v1 submitted 17 November, 2021;
originally announced November 2021.
-
pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks
Authors:
Juan Manuel Pérez,
Mariela Rajngewerc,
Juan Carlos Giudici,
Damián A. Furman,
Franco Luque,
Laura Alonso Alemany,
María Vanina Martínez
Abstract:
In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-…
▽ More
In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-experts. To address these issues, we present pysentimiento, a comprehensive multilingual Python toolkit designed for opinion mining and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library, allowing researchers to leverage these techniques. We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets, including an evaluation of fairness in the results.
△ Less
Submitted 25 October, 2023; v1 submitted 17 June, 2021;
originally announced June 2021.
-
Magnetic ordering and magnetodielectric phenomena in CoSeO$_4$
Authors:
Brent C. Melot,
Lucy E. Darago,
Ram Seshadri,
Abby Goldman,
Joshua D. Furman,
Efrain E. Rodriguez
Abstract:
CoSeO$_4$ has a structure consisting of edge-sharing chains of Co$^{2+}$ octahedra which are held together by SeO$_4^{2-}$ tetrahedra via shared oxygen atoms at the edges of the octahedra. DC magnetization measurements indicate a transition to an ordered state below 30 K. Powder neutron diffraction refinements suggest an ordered state with two unique antiferrromagnetic chains within the unit cell…
▽ More
CoSeO$_4$ has a structure consisting of edge-sharing chains of Co$^{2+}$ octahedra which are held together by SeO$_4^{2-}$ tetrahedra via shared oxygen atoms at the edges of the octahedra. DC magnetization measurements indicate a transition to an ordered state below 30 K. Powder neutron diffraction refinements suggest an ordered state with two unique antiferrromagnetic chains within the unit cell. Isothermal magnetization measurements indicate a temperature-dependent field-induced magnetic transition below the ordering temperature. From neutron diffraction, we find this corresponds to a realignment of spins from the canted configuration towards the c-axis. The dielectric constant shows a change in slope at the magnetic ordering temperature as well as a quadratic dependence on the external magnetic field.
△ Less
Submitted 18 March, 2010;
originally announced March 2010.