-
Perceived Social Influence on Vaccination Decisions: A COVID-19 Case Study
Authors:
Denise Yewell,
R. Alexander Bentley,
Benjamin D. Horne
Abstract:
In this study, we examine the perceived influence of others, across both strong and weak social ties, on COVID-19 vaccination decisions in the United States. We add context to social influence by measuring related concepts, such as perceived agreement of others and perceived danger of COVID-19 to others. We find that vaccinated populations perceived more influence from their social circles than un…
▽ More
In this study, we examine the perceived influence of others, across both strong and weak social ties, on COVID-19 vaccination decisions in the United States. We add context to social influence by measuring related concepts, such as perceived agreement of others and perceived danger of COVID-19 to others. We find that vaccinated populations perceived more influence from their social circles than unvaccinated populations. This finding holds true across various social groups, including family, close friends, and neighbors. Vaccinated participants perceived that others agreed with their decision to get vaccinated more than unvaccinated participants perceived others to agree with their decision to not get vaccinated. Despite the clear differences in perceived social influence and agreement across the groups, the majority of participants across both vaccinated and unvaccinated populations perceived no social influence from all social group in their decisions. Aligning with this result, we find through open-ended responses that both vaccinated and unvaccinated participants frequently cited fear as a motivating factor in their decision, rather than social influence: vaccinated participants feared COVID-19, while unvaccinated participants feared the vaccine itself.
△ Less
Submitted 31 May, 2024; v1 submitted 1 April, 2024;
originally announced April 2024.
-
NELA-PS: A Dataset of Pink Slime News Articles for the Study of Local News Ecosystems
Authors:
Benjamin D. Horne,
Maurício Gruppi
Abstract:
Pink slime news outlets automatically produce low-quality, often partisan content that is framed as authentic local news. Given that local news is trusted by Americans and is increasingly shutting down due to financial distress, pink slime news outlets have the potential to exploit local information voids. Yet, there are gaps in understanding of pink slime production practices and tactics, particu…
▽ More
Pink slime news outlets automatically produce low-quality, often partisan content that is framed as authentic local news. Given that local news is trusted by Americans and is increasingly shutting down due to financial distress, pink slime news outlets have the potential to exploit local information voids. Yet, there are gaps in understanding of pink slime production practices and tactics, particularly over time. Hence, to support future research in this area, we built a dataset of over 7.9M articles from 1093 pink slime sources over 2.5 years.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Embedding Elites: Examining the Use of Tweets Embedded in Online News Articles across Reliable and Fringe Outlets
Authors:
Benjamin D. Horne,
Summer Phillips,
Nelia Koontz
Abstract:
This study examines the use of embedded tweets in online news media. In particular, we add to the previous literature by exploring embedded tweets across reliable and unreliable news outlets. We use a mixed-method analysis to examine how the function and frequency of embedded tweets change across outlet reliability and news topic. We find that, no matter the outlet reliability, embedded tweets are…
▽ More
This study examines the use of embedded tweets in online news media. In particular, we add to the previous literature by exploring embedded tweets across reliable and unreliable news outlets. We use a mixed-method analysis to examine how the function and frequency of embedded tweets change across outlet reliability and news topic. We find that, no matter the outlet reliability, embedded tweets are most often used to relay the opinions of elites, to syndicate information from another news source, or to self-cite information an outlet previously produced. Our results also show some notable differences between reliable media and fringe media's use of tweets. Namely, fringe media embed tweets more and use those tweets as the source of news more than reliable media. Our work adds to the literature on hybrid media systems and the normalization of social media in journalism.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Is disruption decreasing, or is it accelerating?
Authors:
R. Alexander Bentley,
Sergi Valverde,
Joshua Borycz,
Blai Vidiella,
Benjamin D. Horne,
Salva Duran-Nebreda,
Michael J. O'Brien
Abstract:
A recent highly-publicized study by Park et al. (Nature 613: 138-144, 2023), claiming that science has become less disruptive over recent decades, represents an extraordinary achievement but with deceptive results. The measure of disruption, CD-5, in this study does not account for differences in citation amid decades of exponential growth in publication rate. In order to account for both the expo…
▽ More
A recent highly-publicized study by Park et al. (Nature 613: 138-144, 2023), claiming that science has become less disruptive over recent decades, represents an extraordinary achievement but with deceptive results. The measure of disruption, CD-5, in this study does not account for differences in citation amid decades of exponential growth in publication rate. In order to account for both the exponential growth as well as the differential impact of research works over time, here we apply a weighted disruption index to the same dataset. We find that, among research papers in the dataset, this weighted disruption index has been close to its expected neutral value over the last fifty years and has even increased modestly since 2000. We also show how the proportional decrease in unique words (highlighted by Park et al. (2023) is expected in an exponentially growing corpus. Finding little evidence for recent decrease in disruption, we suggest that it is actually increasing.
△ Less
Submitted 25 June, 2023;
originally announced June 2023.
-
Examining the Production of Co-active Channels on YouTube and BitChute
Authors:
Matthew C. Childs,
Benjamin D. Horne
Abstract:
A concern among content moderation researchers is that hard moderation measures, such as banning content producers, will push users to more extreme information environments. Research in this area is still new, but predominately focuses on one-way migration (from mainstream to alt-tech) due to this concern. However, content producers on alt-tech social media platforms are not always banned users fr…
▽ More
A concern among content moderation researchers is that hard moderation measures, such as banning content producers, will push users to more extreme information environments. Research in this area is still new, but predominately focuses on one-way migration (from mainstream to alt-tech) due to this concern. However, content producers on alt-tech social media platforms are not always banned users from mainstream platforms, instead they may be co-active across platforms. We explore co-activity on two such platforms: YouTube and BitChute. Specifically, we describe differences in video production across 27 co-active channels. We find that the majority of channels use significantly more moral and political words in their video titles on BitChute than in their video titles on YouTube. However, the reasoning for this shift seems to be different across channels. In some cases, we find that channels produce videos on different sets of topics across the platforms, often producing content on BitChute that would likely be moderated on YouTube. In rare cases, we find video titles of the same video change across the platforms. Overall, there is not a consistent trend across co-active channels in our sample, suggesting that the production on alt-tech social media platforms does not fit a single narrative.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
A Psycho-linguistic Analysis of BitChute
Authors:
Benjamin D. Horne
Abstract:
In order to better support researchers, journalist, and practitioners in their use of the MeLa-BitChute dataset for exploration and investigative reporting, we provide new psycho-linguistic metadata for the videos, comments, and channels in the dataset using LIWC22. This paper describes that metadata and methods to filter the data using the metadata. In addition, we provide basic analysis and comp…
▽ More
In order to better support researchers, journalist, and practitioners in their use of the MeLa-BitChute dataset for exploration and investigative reporting, we provide new psycho-linguistic metadata for the videos, comments, and channels in the dataset using LIWC22. This paper describes that metadata and methods to filter the data using the metadata. In addition, we provide basic analysis and comparison of the language on BitChute to other social media platforms. The MeLa-BitChute dataset and LIWC metadata described in this paper can be found at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KRD1VS.
△ Less
Submitted 20 April, 2022; v1 submitted 17 April, 2022;
originally announced April 2022.
-
Characterizing YouTube and BitChute Content and Mobilizers During U.S. Election Fraud Discussions on Twitter
Authors:
Matthew C. Childs,
Cody Buntain,
Milo Z. Trujillo,
Benjamin D. Horne
Abstract:
In this study, we characterize the cross-platform mobilization of YouTube and BitChute videos on Twitter during the 2020 U.S. Election fraud discussions. Specifically, we extend the VoterFraud2020 dataset to describe the prevalence of content supplied by both platforms, the mobilizers of that content, the suppliers of that content, and the content itself. We find that while BitChute videos promoti…
▽ More
In this study, we characterize the cross-platform mobilization of YouTube and BitChute videos on Twitter during the 2020 U.S. Election fraud discussions. Specifically, we extend the VoterFraud2020 dataset to describe the prevalence of content supplied by both platforms, the mobilizers of that content, the suppliers of that content, and the content itself. We find that while BitChute videos promoting election fraud claims were linked to and engaged with in the Twitter discussion, they played a relatively small role compared to YouTube videos promoting fraud claims. This core finding points to the continued need for proactive, consistent, and collaborative content moderation solutions rather than the reactive and inconsistent solutions currently being used. Additionally, we find that cross-platform disinformation spread from video platforms was not prominently from bot accounts or political elites, but rather average Twitter users. This finding supports past work arguing that research on disinformation should move beyond a focus on bots and trolls to a focus on participatory disinformation spread.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
NELA-Local: A Dataset of U.S. Local News Articles for the Study of County-level News Ecosystems
Authors:
Benjamin D. Horne,
Maurício Gruppi,
Kenneth Joseph,
Jon Green,
John P. Wihbey,
Sibel Adalı
Abstract:
In this paper, we present a dataset of over 1.4M online news articles from 313 local U.S. news outlets published over 20 months (between April 4th, 2020 and December 31st, 2021). These outlets cover a geographically diverse set of communities across the United States. In order to estimate characteristics of the local audience, included with this news article data is a wide range of county-level me…
▽ More
In this paper, we present a dataset of over 1.4M online news articles from 313 local U.S. news outlets published over 20 months (between April 4th, 2020 and December 31st, 2021). These outlets cover a geographically diverse set of communities across the United States. In order to estimate characteristics of the local audience, included with this news article data is a wide range of county-level metadata, including demographics, 2020 Presidential Election vote shares, and community resilience estimates from the U.S. Census Bureau. The NELA-Local dataset can be found at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GFE66K.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
NELA-GT-2022: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles
Authors:
Maurício Gruppi,
Benjamin D. Horne,
Sibel Adalı
Abstract:
In this paper, we present the fifth installment of the NELA-GT datasets, NELA-GT-2022. The dataset contains 1,778,361 articles from 361 outlets between January 1st, 2022 and December 31st, 2022. Just as in past releases of the dataset, NELA-GT-2022 includes outlet-level veracity labels from Media Bias/Fact Check and tweets embedded in collected news articles. The NELA-GT-2022 dataset can be found…
▽ More
In this paper, we present the fifth installment of the NELA-GT datasets, NELA-GT-2022. The dataset contains 1,778,361 articles from 361 outlets between January 1st, 2022 and December 31st, 2022. Just as in past releases of the dataset, NELA-GT-2022 includes outlet-level veracity labels from Media Bias/Fact Check and tweets embedded in collected news articles. The NELA-GT-2022 dataset can be found at: https://doi.org/10.7910/DVN/AMCV2H
△ Less
Submitted 17 March, 2023; v1 submitted 10 March, 2022;
originally announced March 2022.
-
The MeLa BitChute Dataset
Authors:
Milo Trujillo,
Maurício Gruppi,
Cody Buntain,
Benjamin D. Horne
Abstract:
In this paper we present a near-complete dataset of over 3M videos from 61K channels over 2.5 years (June 2019 to December 2021) from the social video hosting platform BitChute, a commonly used alternative to YouTube. Additionally, we include a variety of video-level metadata, including comments, channel descriptions, and views for each video. The MeLa-BitChute dataset can be found at: https://dat…
▽ More
In this paper we present a near-complete dataset of over 3M videos from 61K channels over 2.5 years (June 2019 to December 2021) from the social video hosting platform BitChute, a commonly used alternative to YouTube. Additionally, we include a variety of video-level metadata, including comments, channel descriptions, and views for each video. The MeLa-BitChute dataset can be found at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KRD1VS.
△ Less
Submitted 10 February, 2022;
originally announced February 2022.
-
Local News Online and COVID in the U.S.: Relationships among Coverage, Cases, Deaths, and Audience
Authors:
Kenneth Joseph,
Benjamin D. Horne,
Jon Green,
John P. Wihbey
Abstract:
We present analyses from a real-time information monitoring system of online local news in the U.S. We study relationships among online local news coverage of COVID, cases and deaths in an area, and properties of local news outlets and their audiences. Our analysis relies on a unique dataset of the online content of over 300 local news outlets, encompassing over 750,000 articles over a period of 1…
▽ More
We present analyses from a real-time information monitoring system of online local news in the U.S. We study relationships among online local news coverage of COVID, cases and deaths in an area, and properties of local news outlets and their audiences. Our analysis relies on a unique dataset of the online content of over 300 local news outlets, encompassing over 750,000 articles over a period of 10 months spanning April 2020 to February 2021. We find that the rate of COVID coverage over time by local news outlets was primarily associated with death rates at the national level, but that this effect dissipated over the course of the pandemic as news about COVID was steadily displaced by sociopolitical events, like the 2020 U.S. elections. We also find that both the volume and content of COVID coverage differed depending on local politics, and outlet audience size, as well as evidence that more vulnerable populations received less pandemic-related news.
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
NELA-GT-2020: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles
Authors:
Maurício Gruppi,
Benjamin D. Horne,
Sibel Adalı
Abstract:
In this paper, we present an updated version of the NELA-GT-2019 dataset, entitled NELA-GT-2020. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-leve…
▽ More
In this paper, we present an updated version of the NELA-GT-2019 dataset, entitled NELA-GT-2020. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-level ground truth labels from Media Bias/Fact Check (MBFC) covering multiple dimensions of veracity. Additionally, new in the 2020 dataset are the Tweets embedded in the collected news articles, adding an extra layer of information to the data. The NELA-GT-2020 dataset can be found at https://doi.org/10.7910/DVN/CHMUYZ.
△ Less
Submitted 8 February, 2021;
originally announced February 2021.
-
Tell Me Who Your Friends Are: Using Content Sharing Behavior for News Source Veracity Detection
Authors:
Maurício Gruppi,
Benjamin D. Horne,
Sibel Adalı
Abstract:
Stop** the malicious spread and production of false and misleading news has become a top priority for researchers. Due to this prevalence, many automated methods for detecting low quality information have been introduced. The majority of these methods have used article-level features, such as their writing style, to detect veracity. While writing style models have been shown to work well in lab-…
▽ More
Stop** the malicious spread and production of false and misleading news has become a top priority for researchers. Due to this prevalence, many automated methods for detecting low quality information have been introduced. The majority of these methods have used article-level features, such as their writing style, to detect veracity. While writing style models have been shown to work well in lab-settings, there are concerns of generalizability and robustness. In this paper, we begin to address these concerns by proposing a novel and robust news veracity detection model that uses the content sharing behavior of news sources formulated as a network. We represent these content sharing networks (CSN) using a deep walk based method for embedding graphs that accounts for similarity in both the network space and the article text space. We show that state of the art writing style and CSN features make diverse mistakes when predicting, meaning that they both play different roles in the classification task. Moreover, we show that the addition of CSN features increases the accuracy of writing style models, boosting accuracy as much as 14\% when using Random Forests. Similarly, we show that the combination of hand-crafted article-level features and CSN features is robust to concept drift, performing consistently well over a 10-month time frame.
△ Less
Submitted 15 January, 2021;
originally announced January 2021.
-
Do All Good Actors Look The Same? Exploring News Veracity Detection Across The U.S. and The U.K
Authors:
Benjamin D. Horne,
Maurício Gruppi,
Sibel Adalı
Abstract:
A major concern with text-based news veracity detection methods is that they may not generalize across countries and cultures. In this short paper, we explicitly test news veracity models across news data from the United States and the United Kingdom, demonstrating there is reason for concern of generalizabilty. Through a series of testing scenarios, we show that text-based classifiers perform poo…
▽ More
A major concern with text-based news veracity detection methods is that they may not generalize across countries and cultures. In this short paper, we explicitly test news veracity models across news data from the United States and the United Kingdom, demonstrating there is reason for concern of generalizabilty. Through a series of testing scenarios, we show that text-based classifiers perform poorly when trained on one country's news data and tested on another. Furthermore, these same models have trouble classifying unseen, unreliable news sources. In conclusion, we discuss implications of these results and avenues for future work.
△ Less
Submitted 26 May, 2020;
originally announced June 2020.
-
What is BitChute? Characterizing the "Free Speech" Alternative to YouTube
Authors:
Milo Trujillo,
Maurício Gruppi,
Cody Buntain,
Benjamin D. Horne
Abstract:
In this paper, we characterize the content and discourse on BitChute, a social video-hosting platform. Launched in 2017 as an alternative to YouTube, BitChute joins an ecosystem of alternative, low content moderation platforms, including Gab, Voat, Minds, and 4chan. Uniquely, BitChute is the first of these alternative platforms to focus on video content and is growing in popularity. Our analysis r…
▽ More
In this paper, we characterize the content and discourse on BitChute, a social video-hosting platform. Launched in 2017 as an alternative to YouTube, BitChute joins an ecosystem of alternative, low content moderation platforms, including Gab, Voat, Minds, and 4chan. Uniquely, BitChute is the first of these alternative platforms to focus on video content and is growing in popularity. Our analysis reveals several key characteristics of the platform. We find that only a handful of channels receive any engagement, and almost all of those channels contain conspiracies or hate speech. This high rate of hate speech on the platform as a whole, much of which is anti-Semitic, is particularly concerning. Our results suggest that BitChute has a higher rate of hate speech than Gab but less than 4chan. Lastly, we find that while some BitChute content producers have been banned from other platforms, many maintain profiles on mainstream social media platforms, particularly YouTube. This paper contributes a first look at the content and discourse on BitChute and provides a building block for future research on low content moderation platforms.
△ Less
Submitted 29 May, 2020; v1 submitted 4 April, 2020;
originally announced April 2020.
-
NELA-GT-2019: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles
Authors:
Maurício Gruppi,
Benjamin D. Horne,
Sibel Adalı
Abstract:
In this paper, we present an updated version of the NELA-GT-2018 dataset (Nørregaard, Horne, and Adalı 2019), entitled NELA-GT-2019. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset ar…
▽ More
In this paper, we present an updated version of the NELA-GT-2018 dataset (Nørregaard, Horne, and Adalı 2019), entitled NELA-GT-2019. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity. The NELA-GT-2019 dataset can be found at: https://doi.org/10.7910/DVN/O7FWPO
△ Less
Submitted 26 March, 2020; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Trustworthy Misinformation Mitigation with Soft Information Nudging
Authors:
Benjamin D. Horne,
Maurício Gruppi,
Sibel Adalı
Abstract:
Research in combating misinformation reports many negative results: facts may not change minds, especially if they come from sources that are not trusted. Individuals can disregard and justify lies told by trusted sources. This problem is made even worse by social recommendation algorithms which help amplify conspiracy theories and information confirming one's own biases due to companies' efforts…
▽ More
Research in combating misinformation reports many negative results: facts may not change minds, especially if they come from sources that are not trusted. Individuals can disregard and justify lies told by trusted sources. This problem is made even worse by social recommendation algorithms which help amplify conspiracy theories and information confirming one's own biases due to companies' efforts to optimize for clicks and watch time over individuals' own values and public good. As a result, more nuanced voices and facts are drowned out by a continuous erosion of trust in better information sources. Most misinformation mitigation techniques assume that discrediting, filtering, or demoting low veracity information will help news consumers make better information decisions. However, these negative results indicate that some news consumers, particularly extreme or conspiracy news consumers will not be helped.
We argue that, given this background, technology solutions to combating misinformation should not simply seek facts or discredit bad news sources, but instead use more subtle nudges towards better information consumption. Repeated exposure to such nudges can help promote trust in better information sources and also improve societal outcomes in the long run. In this article, we will talk about technological solutions that can help us in develo** such an approach, and introduce one such model called Trust Nudging.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.
-
NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles
Authors:
Jeppe Norregaard,
Benjamin D. Horne,
Sibel Adali
Abstract:
In this paper, we present a dataset of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. We incorporate ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence…
▽ More
In this paper, we present a dataset of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. We incorporate ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust. The NELA-GT-2018 dataset can be found at https://doi.org/10.7910/DVN/ULHLCB.
△ Less
Submitted 2 April, 2019;
originally announced April 2019.
-
Different Spirals of Sameness: A Study of Content Sharing in Mainstream and Alternative Media
Authors:
Benjamin D. Horne,
Jeppe Norregaard,
Sibel Adali
Abstract:
In this paper, we analyze content sharing between news sources in the alternative and mainstream media using a dataset of 713K articles and 194 sources. We find that content sharing happens in tightly formed communities, and these communities represent relatively homogeneous portions of the media landscape. Through a mix-method analysis, we find several primary content sharing behaviors. First, we…
▽ More
In this paper, we analyze content sharing between news sources in the alternative and mainstream media using a dataset of 713K articles and 194 sources. We find that content sharing happens in tightly formed communities, and these communities represent relatively homogeneous portions of the media landscape. Through a mix-method analysis, we find several primary content sharing behaviors. First, we find that the vast majority of shared articles are only shared with similar news sources (i.e. same community). Second, we find that despite these echo-chambers of sharing, specific sources, such as The Drudge Report, mix content from both mainstream and conspiracy communities. Third, we show that while these differing communities do not always share news articles, they do report on the same events, but often with competing and counter-narratives. Overall, we find that the news is homogeneous within communities and diverse in between, creating different spirals of sameness.
△ Less
Submitted 2 April, 2019;
originally announced April 2019.
-
Rating Reliability and Bias in News Articles: Does AI Assistance Help Everyone?
Authors:
Benjamin D. Horne,
Dorit Nevo,
John O'Donovan,
**-Hee Cho,
Sibel Adali
Abstract:
With the spread of false and misleading information in current news, many algorithmic tools have been introduced with the aim of assessing bias and reliability in written content. However, there has been little work exploring how effective these tools are at changing human perceptions of content. To this end, we conduct a study with 654 participants to understand if algorithmic assistance improves…
▽ More
With the spread of false and misleading information in current news, many algorithmic tools have been introduced with the aim of assessing bias and reliability in written content. However, there has been little work exploring how effective these tools are at changing human perceptions of content. To this end, we conduct a study with 654 participants to understand if algorithmic assistance improves the accuracy of reliability and bias perceptions, and whether there is a difference in the effectiveness of the AI assistance for different types of news consumers. We find that AI assistance with feature-based explanations improves the accuracy of news perceptions. However, some consumers are helped more than others. Specifically, we find that participants who read and share news often on social media are worse at recognizing bias and reliability issues in news articles than those who do not, while frequent news readers and those familiar with politics perform much better. We discuss these differences and their implication to offer insights for future research.
△ Less
Submitted 16 May, 2019; v1 submitted 2 April, 2019;
originally announced April 2019.
-
Models for Predicting Community-Specific Interest in News Articles
Authors:
Benjamin D. Horne,
William Dron,
Sibel Adali
Abstract:
In this work, we ask two questions: 1. Can we predict the type of community interested in a news article using only features from the article content? and 2. How well do these models generalize over time? To answer these questions, we compute well-studied content-based features on over 60K news articles from 4 communities on reddit.com. We train and test models over three different time periods be…
▽ More
In this work, we ask two questions: 1. Can we predict the type of community interested in a news article using only features from the article content? and 2. How well do these models generalize over time? To answer these questions, we compute well-studied content-based features on over 60K news articles from 4 communities on reddit.com. We train and test models over three different time periods between 2015 and 2017 to demonstrate which features degrade in performance the most due to concept drift. Our models can classify news articles into communities with high accuracy, ranging from 0.81 ROC AUC to 1.0 ROC AUC. However, while we can predict the community-specific popularity of news articles with high accuracy, practitioners should approach these models carefully. Predictions are both community-pair dependent and feature group dependent. Moreover, these feature groups generalize over time differently, with some only degrading slightly over time, but others degrading greatly. Therefore, we recommend that community-interest predictions are done in a hierarchical structure, where multiple binary classifiers can be used to separate community pairs, rather than a traditional multi-class model. Second, these models should be retrained over time based on accuracy goals and the availability of training data.
△ Less
Submitted 27 August, 2018;
originally announced August 2018.
-
An Exploration of Unreliable News Classification in Brazil and The U.S
Authors:
Mauricio Gruppi,
Benjamin D. Horne,
Sibel Adali
Abstract:
The propagation of unreliable information is on the rise in many places around the world. This expansion is facilitated by the rapid spread of information and anonymity granted by the Internet. The spread of unreliable information is a wellstudied issue and it is associated with negative social impacts. In a previous work, we have identified significant differences in the structure of news article…
▽ More
The propagation of unreliable information is on the rise in many places around the world. This expansion is facilitated by the rapid spread of information and anonymity granted by the Internet. The spread of unreliable information is a wellstudied issue and it is associated with negative social impacts. In a previous work, we have identified significant differences in the structure of news articles from reliable and unreliable sources in the US media. Our goal in this work was to explore such differences in the Brazilian media. We found significant features in two data sets: one with Brazilian news in Portuguese and another one with US news in English. Our results show that features related to the writing style were prominent in both data sets and, despite the language difference, some features have a universal behavior, being significant to both US and Brazilian news articles. Finally, we combined both data sets and used the universal features to build a machine learning classifier to predict the source type of a news article as reliable or unreliable.
△ Less
Submitted 7 June, 2018;
originally announced June 2018.
-
An Exploration of Verbatim Content Republishing by News Producers
Authors:
Benjamin D. Horne,
Sibel Adali
Abstract:
In today's news ecosystem, news sources emerge frequently and can vary widely in intent. This intent can range from benign to malicious, with many tactics being used to achieve their goals. One lesser studied tactic is content republishing, which can be used to make specific stories seem more important, create uncertainty around an event, or create a perception of credibility for unreliable news s…
▽ More
In today's news ecosystem, news sources emerge frequently and can vary widely in intent. This intent can range from benign to malicious, with many tactics being used to achieve their goals. One lesser studied tactic is content republishing, which can be used to make specific stories seem more important, create uncertainty around an event, or create a perception of credibility for unreliable news sources. In this paper, we take a first step in understanding this tactic by exploring verbatim content copying across 92 news producers of various characteristics. We find that content copying occurs more frequently between like-audience sources (eg. alternative news, mainstream news, etc.), but there consistently exists sparse connections between these communities. We also find that despite articles being verbatim, the headlines are often changed. Specifically, we find that mainstream sources change more structural features, while alternative sources change many more content features, often changing the emotional tone and bias of the titles. We conclude that content republishing networks can help identify and label the intent of brand-new news sources using the tight-knit community they belong to. In addition, it is possible to use the network to find important content producers in each community, producers that are used to amplify messages of other sources, and producers that distort the messages of other sources.
△ Less
Submitted 15 May, 2018;
originally announced May 2018.
-
Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape
Authors:
Benjamin D. Horne,
William Dron,
Sara Khedr,
Sibel Adali
Abstract:
The complexity and diversity of today's media landscape provides many challenges for researchers studying news producers. These producers use many different strategies to get their message believed by readers through the writing styles they employ, by repetition across different media sources with or without attribution, as well as other mechanisms that are yet to be studied deeply. To better faci…
▽ More
The complexity and diversity of today's media landscape provides many challenges for researchers studying news producers. These producers use many different strategies to get their message believed by readers through the writing styles they employ, by repetition across different media sources with or without attribution, as well as other mechanisms that are yet to be studied deeply. To better facilitate systematic studies in this area, we present a large political news data set, containing over 136K news articles, from 92 news sources, collected over 7 months of 2017. These news sources are carefully chosen to include well-established and mainstream sources, maliciously fake sources, satire sources, and hyper-partisan political blogs. In addition to each article we compute 130 content-based and social media engagement features drawn from a wide range of literature on political bias, persuasion, and misinformation. With the release of the data set, we also provide the source code for feature computation. In this paper, we discuss the first release of the data set and demonstrate 4 use cases of the data and features: news characterization, engagement characterization, news attribution and content copying, and discovering news narratives.
△ Less
Submitted 16 August, 2018; v1 submitted 27 March, 2018;
originally announced March 2018.
-
Identifying the social signals that drive online discussions: A case study of Reddit communities
Authors:
Benjamin D. Horne,
Sibel Adali,
Sujoy Sikdar
Abstract:
Increasingly people form opinions based on information they consume on online social media. As a result, it is crucial to understand what type of content attracts people's attention on social media and drive discussions. In this paper we focus on online discussions. Can we predict which comments and what content gets the highest attention in an online discussion? How does this content differ from…
▽ More
Increasingly people form opinions based on information they consume on online social media. As a result, it is crucial to understand what type of content attracts people's attention on social media and drive discussions. In this paper we focus on online discussions. Can we predict which comments and what content gets the highest attention in an online discussion? How does this content differ from community to community? To accomplish this, we undertake a unique study of Reddit involving a large sample comments from 11 popular subreddits with different properties. We introduce a large number of sentiment, relevance, content analysis features including some novel features customized to reddit. Through a comparative analysis of the chosen subreddits, we show that our models are correctly able to retrieve top replies under a post with great precision. In addition, we explain our findings with a detailed analysis of what distinguishes high scoring posts in different communities that differ along the dimensions of the specificity of topic and style, audience and level of moderation.
△ Less
Submitted 7 May, 2017;
originally announced May 2017.
-
The Impact of Crowds on News Engagement: A Reddit Case Study
Authors:
Benjamin D. Horne,
Sibel Adali
Abstract:
Today, users are reading the news through social platforms. These platforms are built to facilitate crowd engagement, but not necessarily disseminate useful news to inform the masses. Hence, the news that is highly engaged with may not be the news that best informs. While predicting news popularity has been well studied, it has not been studied in the context of crowd manipulations. In this paper,…
▽ More
Today, users are reading the news through social platforms. These platforms are built to facilitate crowd engagement, but not necessarily disseminate useful news to inform the masses. Hence, the news that is highly engaged with may not be the news that best informs. While predicting news popularity has been well studied, it has not been studied in the context of crowd manipulations. In this paper, we provide some preliminary results to a longer term project on crowd and platform manipulations of news and news popularity. In particular, we choose to study known features for predicting news popularity and how those features may change on reddit.com, a social platform used commonly for news aggregation. Along with this, we explore ways in which users can alter the perception of news through changing the title of an article. We find that news on reddit is predictable using previously studied sentiment and content features and that posts with titles changed by reddit users tend to be more popular than posts with the original article title.
△ Less
Submitted 3 November, 2017; v1 submitted 30 March, 2017;
originally announced March 2017.
-
This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News
Authors:
Benjamin D. Horne,
Sibel Adali
Abstract:
The problem of fake news has gained a lot of attention as it is claimed to have had a significant impact on 2016 US Presidential Elections. Fake news is not a new problem and its spread in social networks is well-studied. Often an underlying assumption in fake news discussion is that it is written to look like real news, fooling the reader who does not check for reliability of the sources or the a…
▽ More
The problem of fake news has gained a lot of attention as it is claimed to have had a significant impact on 2016 US Presidential Elections. Fake news is not a new problem and its spread in social networks is well-studied. Often an underlying assumption in fake news discussion is that it is written to look like real news, fooling the reader who does not check for reliability of the sources or the arguments in its content. Through a unique study of three data sets and features that capture the style and the language of articles, we show that this assumption is not true. Fake news in most cases is more similar to satire than to real news, leading us to conclude that persuasion in fake news is achieved through heuristics rather than the strength of arguments. We show overall title structure and the use of proper nouns in titles are very significant in differentiating fake from real. This leads us to conclude that fake news is targeted for audiences who are not likely to read beyond titles and is aimed at creating mental associations between entities and claims.
△ Less
Submitted 28 March, 2017;
originally announced March 2017.