-
Close to Human-Level Agreement: Tracing Journeys of Violent Speech in Incel Posts with GPT-4-Enhanced Annotations
Authors:
Daniel Matter,
Miriam Schirmer,
Nir Grinberg,
Jürgen Pfeffer
Abstract:
This study investigates the prevalence of violent language on incels.is. It evaluates GPT models (GPT-3.5 and GPT-4) for content analysis in social sciences, focusing on the impact of varying prompts and batch sizes on coding quality for the detection of violent speech. We scraped over 6.9M posts from incels.is and categorized a random sample into non-violent, explicitly violent, and implicitly vi…
▽ More
This study investigates the prevalence of violent language on incels.is. It evaluates GPT models (GPT-3.5 and GPT-4) for content analysis in social sciences, focusing on the impact of varying prompts and batch sizes on coding quality for the detection of violent speech. We scraped over 6.9M posts from incels.is and categorized a random sample into non-violent, explicitly violent, and implicitly violent content. Two human coders annotated 3,028 posts, which we used to tune and evaluate GPT-3.5 and GPT-4 models across different prompts and batch sizes regarding coding reliability. The best-performing GPT-4 model annotated an additional 30,000 posts for further analysis.
Our findings indicate an overall increase in violent speech overtime on incels.is, both at the community and individual level, particularly among more engaged users. While directed violent language decreases, non-directed violent language increases, and self-harm content shows a decline, especially after 2.5 years of user activity. We find substantial agreement between both human coders (K = .65), while the best GPT-4 model yields good agreement with both human coders (K = 0.54 for Human A and K = 0.62 for Human B). Weighted and macro F1 scores further support this alignment.
Overall, this research provides practical means for accurately identifying violent language at a large scale that can aid content moderation and facilitate next-step research into the causal mechanism and potential mitigations of violent expression and radicalization in communities like incels.is.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification
Authors:
Difan Jiao,
Yilun Liu,
Zhenwei Tang,
Daniel Matter,
Jürgen Pfeffer,
Ashton Anderson
Abstract:
Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. Current text classification paradigms, however, rely solely on the output of the final layer in the LLM, with the rich information contained in internal neurons largely untapped. In this study, we present SPIN: a model-agnostic framework that sparsifies and integrates internal neurons of intermediate…
▽ More
Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. Current text classification paradigms, however, rely solely on the output of the final layer in the LLM, with the rich information contained in internal neurons largely untapped. In this study, we present SPIN: a model-agnostic framework that sparsifies and integrates internal neurons of intermediate layers of LLMs for text classification. Specifically, SPIN sparsifies internal neurons by linear probing-based salient neuron selection layer by layer, avoiding noise from unrelated neurons and ensuring efficiency. The cross-layer salient neurons are then integrated to serve as multi-layered features for the classification head. Extensive experimental results show our proposed SPIN significantly improves text classification accuracy, efficiency, and interpretability.
△ Less
Submitted 5 June, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Temporally Stable Multilayer Network Embeddings: A Longitudinal Study of Russian Propaganda
Authors:
Daniel Matter,
Elizaveta Kuznetsova,
Victoria Vziatysheva,
Ilaria Vitulano,
Juergen Pfeffer
Abstract:
Russian propaganda outlet RT (formerly, Russia Today) produces content in seven languages. There is ample evidence that RT's communication techniques differ for different language audiences. In this article, we offer the first comprehensive analysis of RT's multi-lingual article collection, analyzing all 2.4 million articles available on the online platform from 2006 until 06/2023. Annual semantic…
▽ More
Russian propaganda outlet RT (formerly, Russia Today) produces content in seven languages. There is ample evidence that RT's communication techniques differ for different language audiences. In this article, we offer the first comprehensive analysis of RT's multi-lingual article collection, analyzing all 2.4 million articles available on the online platform from 2006 until 06/2023. Annual semantic networks are created from the co-occurrence of the articles' tags. Within one language, we use AlignedUMAP to get stable inter-temporal embeddings. Between languages, we propose a new method to align multiple, sparsely connected networks in an intermediate representation before projecting them into the final embedding space. With respect to RT's communication strategy, our findings hint at a lack of a coherent strategy in RT's targeting of audiences in different languages, evident through differences in tag usage, clustering patterns, and uneven shifts in the prioritization of themes within language versions. Although identified clusters of tags align with the key themes in Russian propaganda, such as Ukraine, foreign affairs, Western countries, and the Middle East, we have observed significant differences in the attention given to specific issues across languages that are rather reactive to the information environment than representing a cohesive approach.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Synchronization of quantum communication over an optical classical communication channel
Authors:
Federico Berra,
Costantino Agnesi,
Andrea Stanco,
Marco Avesani,
Michal Kuklewski,
Daniel Matter,
Paolo Villoresi,
Giuseppe Vallone
Abstract:
Precise synchronization between transmitter and receiver is crucial for quantum communication protocols, such as Quantum Key Distribution (QKD), to efficiently correlate the transmitted and received signals and increase the signal-to-noise ratio. In this work, we introduce a synchronization technique that exploits a co-propagating classical optical communication link and test its performance in a…
▽ More
Precise synchronization between transmitter and receiver is crucial for quantum communication protocols, such as Quantum Key Distribution (QKD), to efficiently correlate the transmitted and received signals and increase the signal-to-noise ratio. In this work, we introduce a synchronization technique that exploits a co-propagating classical optical communication link and test its performance in a free-space QKD system. Previously, existing techniques required additional laser beams or relied on the capability of retrieving the synchronization from the quantum signal itself, though this is not applicable in high channel loss scenarios. On the contrary, our method exploits classical and quantum signals locked to the same master clock, allowing the receiver to synchronize both the classical and quantum communication links by performing a clock-data-recovery routine on the classical signal. In this way, by exploiting the same classical communication already required for post-processing and key generation, no additional hardware is required, and the synchronization can be reconstructed from a high-power signal. Our approach is suitable for both satellite and fiber infrastructures, where a classical and quantum channel can be transmitted through the same link.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
The Half-Life of a Tweet
Authors:
Juergen Pfeffer,
Daniel Matter,
Anahit Sargsyan
Abstract:
Twitter has started to share an impression_count variable as part of the available public metrics for every Tweet collected with Twitter's APIs. With the information about how often a particular Tweet has been shown to Twitter users at the time of data collection, we can learn important insights about the dissemination process of a Tweet by measuring its impression count repeatedly over time. With…
▽ More
Twitter has started to share an impression_count variable as part of the available public metrics for every Tweet collected with Twitter's APIs. With the information about how often a particular Tweet has been shown to Twitter users at the time of data collection, we can learn important insights about the dissemination process of a Tweet by measuring its impression count repeatedly over time. With our preliminary analysis, we can show that on average the peak of impressions per second is 72 seconds after a Tweet was sent and that after 24 hours, no relevant number of impressions can be observed for ~95% of all Tweets. Finally, we estimate that the median half-life of a Tweet, i.e. the time it takes before half of all impressions are created, is about 80 minutes.
△ Less
Submitted 11 April, 2023; v1 submitted 19 February, 2023;
originally announced February 2023.
-
Just Another Day on Twitter: A Complete 24 Hours of Twitter Data
Authors:
Juergen Pfeffer,
Daniel Matter,
Kokil Jaidka,
Onur Varol,
Afra Mashhadi,
Jana Lasser,
Dennis Assenmacher,
Siqi Wu,
Diyi Yang,
Cornelia Brantner,
Daniel M. Romero,
Jahna Otterbacher,
Carsten Schwemmer,
Kenneth Joseph,
David Garcia,
Fred Morstatter
Abstract:
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site…
▽ More
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected all 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
△ Less
Submitted 11 April, 2023; v1 submitted 26 January, 2023;
originally announced January 2023.
-
Asteroids' physical models from combined dense and sparse photometry and scaling of the YORP effect by the observed obliquity distribution
Authors:
J. Hanuš,
J. Ďurech,
M. Brož,
A. Marciniak,
B. D. Warner,
F. Pilcher,
R. Stephens,
R. Behrend,
B. Carry,
D. Čapek,
P. Antonini,
M. Audejean,
K. Augustesen,
E. Barbotin,
P. Baudouin,
A. Bayol,
L. Bernasconi,
W. Borczyk,
J. -G. Bosch,
E. Brochard,
L. Brunetto,
S. Casulli,
A. Cazenave,
S. Charbonnel,
B. Christophe
, et al. (95 additional authors not shown)
Abstract:
The larger number of models of asteroid shapes and their rotational states derived by the lightcurve inversion give us better insight into both the nature of individual objects and the whole asteroid population. With a larger statistical sample we can study the physical properties of asteroid populations, such as main-belt asteroids or individual asteroid families, in more detail. Shape models can…
▽ More
The larger number of models of asteroid shapes and their rotational states derived by the lightcurve inversion give us better insight into both the nature of individual objects and the whole asteroid population. With a larger statistical sample we can study the physical properties of asteroid populations, such as main-belt asteroids or individual asteroid families, in more detail. Shape models can also be used in combination with other types of observational data (IR, adaptive optics images, stellar occultations), e.g., to determine sizes and thermal properties. We use all available photometric data of asteroids to derive their physical models by the lightcurve inversion method and compare the observed pole latitude distributions of all asteroids with known convex shape models with the simulated pole latitude distributions. We used classical dense photometric lightcurves from several sources and sparse-in-time photometry from the U.S. Naval Observatory in Flagstaff, Catalina Sky Survey, and La Palma surveys (IAU codes 689, 703, 950) in the lightcurve inversion method to determine asteroid convex models and their rotational states. We also extended a simple dynamical model for the spin evolution of asteroids used in our previous paper. We present 119 new asteroid models derived from combined dense and sparse-in-time photometry. We discuss the reliability of asteroid shape models derived only from Catalina Sky Survey data (IAU code 703) and present 20 such models. By using different values for a scaling parameter cYORP (corresponds to the magnitude of the YORP momentum) in the dynamical model for the spin evolution and by comparing synthetics and observed pole-latitude distributions, we were able to constrain the typical values of the cYORP parameter as between 0.05 and 0.6.
△ Less
Submitted 29 January, 2013;
originally announced January 2013.