Search | arXiv e-print repository

SciLander: Map** the Scientific News Landscape

Authors: Maurício Gruppi, Panayiotis Smeros, Sibel Adalı, Carlos Castillo, Karl Aberer

Abstract: The COVID-19 pandemic has fueled the spread of misinformation on social media and the Web as a whole. The phenomenon dubbed `infodemic' has taken the challenges of information veracity and trust to new heights by massively introducing seemingly scientific and technical elements into misleading content. Despite the existing body of work on modeling and predicting misinformation, the coverage of ver… ▽ More The COVID-19 pandemic has fueled the spread of misinformation on social media and the Web as a whole. The phenomenon dubbed `infodemic' has taken the challenges of information veracity and trust to new heights by massively introducing seemingly scientific and technical elements into misleading content. Despite the existing body of work on modeling and predicting misinformation, the coverage of very complex scientific topics with inherent uncertainty and an evolving set of findings, such as COVID-19, provides many new challenges that are not easily solved by existing tools. To address these issues, we introduce SciLander, a method for learning representations of news sources reporting on science-based topics. SciLander extracts four heterogeneous indicators for the news sources; two generic indicators that capture (1) the copying of news stories between sources, and (2) the use of the same terms to mean different things (i.e., the semantic shift of terms), and two scientific indicators that capture (1) the usage of jargon and (2) the stance towards specific citations. We use these indicators as signals of source agreement, sampling pairs of positive (similar) and negative (dissimilar) samples, and combine them in a unified framework to train unsupervised news source embeddings with a triplet margin loss objective. We evaluate our method on a novel COVID-19 dataset containing nearly 1M news articles from 500 sources spanning a period of 18 months since the beginning of the pandemic in 2020. Our results show that the features learned by our model outperform state-of-the-art baseline methods on the task of news veracity classification. Furthermore, a clustering analysis suggests that the learned representations encode information about the reliability, political leaning, and partisanship bias of these sources. △ Less

Submitted 16 May, 2022; originally announced May 2022.

arXiv:2203.08600 [pdf, other]

NELA-Local: A Dataset of U.S. Local News Articles for the Study of County-level News Ecosystems

Authors: Benjamin D. Horne, Maurício Gruppi, Kenneth Joseph, Jon Green, John P. Wihbey, Sibel Adalı

Abstract: In this paper, we present a dataset of over 1.4M online news articles from 313 local U.S. news outlets published over 20 months (between April 4th, 2020 and December 31st, 2021). These outlets cover a geographically diverse set of communities across the United States. In order to estimate characteristics of the local audience, included with this news article data is a wide range of county-level me… ▽ More In this paper, we present a dataset of over 1.4M online news articles from 313 local U.S. news outlets published over 20 months (between April 4th, 2020 and December 31st, 2021). These outlets cover a geographically diverse set of communities across the United States. In order to estimate characteristics of the local audience, included with this news article data is a wide range of county-level metadata, including demographics, 2020 Presidential Election vote shares, and community resilience estimates from the U.S. Census Bureau. The NELA-Local dataset can be found at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/GFE66K. △ Less

Submitted 16 March, 2022; originally announced March 2022.

Comments: Published at ICWSM 2022

arXiv:2203.05659 [pdf, other]

NELA-GT-2022: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Authors: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı

Abstract: In this paper, we present the fifth installment of the NELA-GT datasets, NELA-GT-2022. The dataset contains 1,778,361 articles from 361 outlets between January 1st, 2022 and December 31st, 2022. Just as in past releases of the dataset, NELA-GT-2022 includes outlet-level veracity labels from Media Bias/Fact Check and tweets embedded in collected news articles. The NELA-GT-2022 dataset can be found… ▽ More In this paper, we present the fifth installment of the NELA-GT datasets, NELA-GT-2022. The dataset contains 1,778,361 articles from 361 outlets between January 1st, 2022 and December 31st, 2022. Just as in past releases of the dataset, NELA-GT-2022 includes outlet-level veracity labels from Media Bias/Fact Check and tweets embedded in collected news articles. The NELA-GT-2022 dataset can be found at: https://doi.org/10.7910/DVN/AMCV2H △ Less

Submitted 17 March, 2023; v1 submitted 10 March, 2022; originally announced March 2022.

Comments: Technical report documenting the NELA-GT recent update (NELA-GT-2022). arXiv admin note: substantial text overlap with arXiv:2102.04567

arXiv:2102.04567 [pdf, other]

NELA-GT-2020: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Authors: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı

Abstract: In this paper, we present an updated version of the NELA-GT-2019 dataset, entitled NELA-GT-2020. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-leve… ▽ More In this paper, we present an updated version of the NELA-GT-2019 dataset, entitled NELA-GT-2020. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-level ground truth labels from Media Bias/Fact Check (MBFC) covering multiple dimensions of veracity. Additionally, new in the 2020 dataset are the Tweets embedded in the collected news articles, adding an extra layer of information to the data. The NELA-GT-2020 dataset can be found at https://doi.org/10.7910/DVN/CHMUYZ. △ Less

Submitted 8 February, 2021; originally announced February 2021.

Comments: 6 pages, 4 figures. arXiv admin note: text overlap with arXiv:2003.08444

arXiv:2102.00290 [pdf, other]

Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks

Authors: Maurício Gruppi, Sibel Adalı, Pin-Yu Chen

Abstract: The use of language is subject to variation over time as well as across social groups and knowledge domains, leading to differences even in the monolingual scenario. Such variation in word usage is often called lexical semantic change (LSC). The goal of LSC is to characterize and quantify language variations with respect to word meaning, to measure how distinct two language sources are (that is, p… ▽ More The use of language is subject to variation over time as well as across social groups and knowledge domains, leading to differences even in the monolingual scenario. Such variation in word usage is often called lexical semantic change (LSC). The goal of LSC is to characterize and quantify language variations with respect to word meaning, to measure how distinct two language sources are (that is, people or language models). Because there is hardly any data available for such a task, most solutions involve unsupervised methods to align two embeddings and predict semantic change with respect to a distance measure. To that end, we propose a self-supervised approach to model lexical semantic change by generating training samples by introducing perturbations of word vectors in the input corpora. We show that our method can be used for the detection of semantic change with any alignment method. Furthermore, it can be used to choose the landmark words to use in alignment and can lead to substantial improvements over the existing techniques for alignment. We illustrate the utility of our techniques using experimental results on three different datasets, involving words with the same or different meanings. Our methods not only provide significant improvements but also can lead to novel findings for the LSC problem. △ Less

Submitted 30 January, 2021; originally announced February 2021.

Comments: Published at AAAI-2021

arXiv:2101.10973 [pdf, other]

Tell Me Who Your Friends Are: Using Content Sharing Behavior for News Source Veracity Detection

Authors: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı

Abstract: Stop** the malicious spread and production of false and misleading news has become a top priority for researchers. Due to this prevalence, many automated methods for detecting low quality information have been introduced. The majority of these methods have used article-level features, such as their writing style, to detect veracity. While writing style models have been shown to work well in lab-… ▽ More Stop** the malicious spread and production of false and misleading news has become a top priority for researchers. Due to this prevalence, many automated methods for detecting low quality information have been introduced. The majority of these methods have used article-level features, such as their writing style, to detect veracity. While writing style models have been shown to work well in lab-settings, there are concerns of generalizability and robustness. In this paper, we begin to address these concerns by proposing a novel and robust news veracity detection model that uses the content sharing behavior of news sources formulated as a network. We represent these content sharing networks (CSN) using a deep walk based method for embedding graphs that accounts for similarity in both the network space and the article text space. We show that state of the art writing style and CSN features make diverse mistakes when predicting, meaning that they both play different roles in the classification task. Moreover, we show that the addition of CSN features increases the accuracy of writing style models, boosting accuracy as much as 14\% when using Random Forests. Similarly, we show that the combination of hand-crafted article-level features and CSN features is robust to concept drift, performing consistently well over a 10-month time frame. △ Less

Submitted 15 January, 2021; originally announced January 2021.

Comments: Preprint Version

arXiv:2012.01603 [pdf, other]

SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical Semantic Change

Authors: Maurício Gruppi, Sibel Adali, Pin-Yu Chen

Abstract: This paper describes SChME (Semantic Change Detection with Model Ensemble), a method usedin SemEval-2020 Task 1 on unsupervised detection of lexical semantic change. SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature. M… ▽ More This paper describes SChME (Semantic Change Detection with Model Ensemble), a method usedin SemEval-2020 Task 1 on unsupervised detection of lexical semantic change. SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature. More specifically, we combine cosine distance of wordvectors combined with a neighborhood-based metric we named Mapped Neighborhood Distance(MAP), and a word frequency differential metric as input signals to our model. Additionally,we explore alignment-based methods to investigate the importance of the landmarks used in thisprocess. Our results show evidence that the number of landmarks used for alignment has a directimpact on the predictive performance of the model. Moreover, we show that languages that sufferless semantic change tend to benefit from using a large number of landmarks, whereas languageswith more semantic change benefit from a more careful choice of landmark number for alignment. △ Less

Submitted 2 December, 2020; originally announced December 2020.

arXiv:2006.01211 [pdf, other]

Do All Good Actors Look The Same? Exploring News Veracity Detection Across The U.S. and The U.K

Authors: Benjamin D. Horne, Maurício Gruppi, Sibel Adalı

Abstract: A major concern with text-based news veracity detection methods is that they may not generalize across countries and cultures. In this short paper, we explicitly test news veracity models across news data from the United States and the United Kingdom, demonstrating there is reason for concern of generalizabilty. Through a series of testing scenarios, we show that text-based classifiers perform poo… ▽ More A major concern with text-based news veracity detection methods is that they may not generalize across countries and cultures. In this short paper, we explicitly test news veracity models across news data from the United States and the United Kingdom, demonstrating there is reason for concern of generalizabilty. Through a series of testing scenarios, we show that text-based classifiers perform poorly when trained on one country's news data and tested on another. Furthermore, these same models have trouble classifying unseen, unreliable news sources. In conclusion, we discuss implications of these results and avenues for future work. △ Less

Submitted 26 May, 2020; originally announced June 2020.

Comments: Published in ICWSM 2020 Data Challenge

arXiv:2005.00596 [pdf, other]

Learning from Noisy Labels with Noise Modeling Network

Authors: Zhuolin Jiang, Jan Silovsky, Man-Hung Siu, William Hartmann, Herbert Gish, Sancar Adali

Abstract: Multi-label image classification has generated significant interest in recent years and the performance of such systems often suffers from the not so infrequent occurrence of incorrect or missing labels in the training data. In this paper, we extend the state-of the-art of training classifiers to jointly deal with both forms of errorful data. We accomplish this by modeling noisy and missing labels… ▽ More Multi-label image classification has generated significant interest in recent years and the performance of such systems often suffers from the not so infrequent occurrence of incorrect or missing labels in the training data. In this paper, we extend the state-of the-art of training classifiers to jointly deal with both forms of errorful data. We accomplish this by modeling noisy and missing labels in multi-label images with a new Noise Modeling Network (NMN) that follows our convolutional neural network (CNN), integrates with it, forming an end-to-end deep learning system, which can jointly learn the noise distribution and CNN parameters. The NMN learns the distribution of noise patterns directly from the noisy data without the need for any clean training data. The NMN can model label noise that depends only on the true label or is also dependent on the image features. We show that the integrated NMN/CNN learning system consistently improves the classification performance, for different levels of label noise, on the MSR-COCO dataset and MSR-VTT dataset. We also show that noise performance improvements are obtained when multiple instance learning methods are used. △ Less

Submitted 1 May, 2020; originally announced May 2020.

arXiv:2003.08444 [pdf, other]

NELA-GT-2019: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Authors: Maurício Gruppi, Benjamin D. Horne, Sibel Adalı

Abstract: In this paper, we present an updated version of the NELA-GT-2018 dataset (Nørregaard, Horne, and Adalı 2019), entitled NELA-GT-2019. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset ar… ▽ More In this paper, we present an updated version of the NELA-GT-2018 dataset (Nørregaard, Horne, and Adalı 2019), entitled NELA-GT-2019. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity. The NELA-GT-2019 dataset can be found at: https://doi.org/10.7910/DVN/O7FWPO △ Less

Submitted 26 March, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: Updated dataset for paper NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles, originally published at ICWSM in 2019

arXiv:1911.05825 [pdf, other]

Trustworthy Misinformation Mitigation with Soft Information Nudging

Authors: Benjamin D. Horne, Maurício Gruppi, Sibel Adalı

Abstract: Research in combating misinformation reports many negative results: facts may not change minds, especially if they come from sources that are not trusted. Individuals can disregard and justify lies told by trusted sources. This problem is made even worse by social recommendation algorithms which help amplify conspiracy theories and information confirming one's own biases due to companies' efforts… ▽ More Research in combating misinformation reports many negative results: facts may not change minds, especially if they come from sources that are not trusted. Individuals can disregard and justify lies told by trusted sources. This problem is made even worse by social recommendation algorithms which help amplify conspiracy theories and information confirming one's own biases due to companies' efforts to optimize for clicks and watch time over individuals' own values and public good. As a result, more nuanced voices and facts are drowned out by a continuous erosion of trust in better information sources. Most misinformation mitigation techniques assume that discrediting, filtering, or demoting low veracity information will help news consumers make better information decisions. However, these negative results indicate that some news consumers, particularly extreme or conspiracy news consumers will not be helped. We argue that, given this background, technology solutions to combating misinformation should not simply seek facts or discredit bad news sources, but instead use more subtle nudges towards better information consumption. Repeated exposure to such nudges can help promote trust in better information sources and also improve societal outcomes in the long run. In this article, we will talk about technological solutions that can help us in develo** such an approach, and introduce one such model called Trust Nudging. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Comments: Published at IEEE TPS 2019

arXiv:1904.01546 [pdf, other]

NELA-GT-2018: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles

Authors: Jeppe Norregaard, Benjamin D. Horne, Sibel Adali

Abstract: In this paper, we present a dataset of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. We incorporate ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence… ▽ More In this paper, we present a dataset of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. We incorporate ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust. The NELA-GT-2018 dataset can be found at https://doi.org/10.7910/DVN/ULHLCB. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Comments: Published at ICWSM 2019

arXiv:1904.01534 [pdf, other]

Different Spirals of Sameness: A Study of Content Sharing in Mainstream and Alternative Media

Authors: Benjamin D. Horne, Jeppe Norregaard, Sibel Adali

Abstract: In this paper, we analyze content sharing between news sources in the alternative and mainstream media using a dataset of 713K articles and 194 sources. We find that content sharing happens in tightly formed communities, and these communities represent relatively homogeneous portions of the media landscape. Through a mix-method analysis, we find several primary content sharing behaviors. First, we… ▽ More In this paper, we analyze content sharing between news sources in the alternative and mainstream media using a dataset of 713K articles and 194 sources. We find that content sharing happens in tightly formed communities, and these communities represent relatively homogeneous portions of the media landscape. Through a mix-method analysis, we find several primary content sharing behaviors. First, we find that the vast majority of shared articles are only shared with similar news sources (i.e. same community). Second, we find that despite these echo-chambers of sharing, specific sources, such as The Drudge Report, mix content from both mainstream and conspiracy communities. Third, we show that while these differing communities do not always share news articles, they do report on the same events, but often with competing and counter-narratives. Overall, we find that the news is homogeneous within communities and diverse in between, creating different spirals of sameness. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Comments: Published at ICWSM 2019

arXiv:1904.01531 [pdf, other]

Rating Reliability and Bias in News Articles: Does AI Assistance Help Everyone?

Authors: Benjamin D. Horne, Dorit Nevo, John O'Donovan, **-Hee Cho, Sibel Adali

Abstract: With the spread of false and misleading information in current news, many algorithmic tools have been introduced with the aim of assessing bias and reliability in written content. However, there has been little work exploring how effective these tools are at changing human perceptions of content. To this end, we conduct a study with 654 participants to understand if algorithmic assistance improves… ▽ More With the spread of false and misleading information in current news, many algorithmic tools have been introduced with the aim of assessing bias and reliability in written content. However, there has been little work exploring how effective these tools are at changing human perceptions of content. To this end, we conduct a study with 654 participants to understand if algorithmic assistance improves the accuracy of reliability and bias perceptions, and whether there is a difference in the effectiveness of the AI assistance for different types of news consumers. We find that AI assistance with feature-based explanations improves the accuracy of news perceptions. However, some consumers are helped more than others. Specifically, we find that participants who read and share news often on social media are worse at recognizing bias and reliability issues in news articles than those who do not, while frequent news readers and those familiar with politics perform much better. We discuss these differences and their implication to offer insights for future research. △ Less

Submitted 16 May, 2019; v1 submitted 2 April, 2019; originally announced April 2019.

Comments: Published at ICWSM 2019

arXiv:1808.09270 [pdf, other]

Models for Predicting Community-Specific Interest in News Articles

Authors: Benjamin D. Horne, William Dron, Sibel Adali

Abstract: In this work, we ask two questions: 1. Can we predict the type of community interested in a news article using only features from the article content? and 2. How well do these models generalize over time? To answer these questions, we compute well-studied content-based features on over 60K news articles from 4 communities on reddit.com. We train and test models over three different time periods be… ▽ More In this work, we ask two questions: 1. Can we predict the type of community interested in a news article using only features from the article content? and 2. How well do these models generalize over time? To answer these questions, we compute well-studied content-based features on over 60K news articles from 4 communities on reddit.com. We train and test models over three different time periods between 2015 and 2017 to demonstrate which features degrade in performance the most due to concept drift. Our models can classify news articles into communities with high accuracy, ranging from 0.81 ROC AUC to 1.0 ROC AUC. However, while we can predict the community-specific popularity of news articles with high accuracy, practitioners should approach these models carefully. Predictions are both community-pair dependent and feature group dependent. Moreover, these feature groups generalize over time differently, with some only degrading slightly over time, but others degrading greatly. Therefore, we recommend that community-interest predictions are done in a hierarchical structure, where multiple binary classifiers can be used to separate community pairs, rather than a traditional multi-class model. Second, these models should be retrained over time based on accuracy goals and the availability of training data. △ Less

Submitted 27 August, 2018; originally announced August 2018.

Comments: Published at IEEE MILCOM 2018 in Los Angeles, CA, USA

arXiv:1807.06519 [pdf, other]

Is Uncertainty Always Bad?: Effect of Topic Competence on Uncertain Opinions

Authors: **-Hee Cho, Sibel Adalı

Abstract: The proliferation of information disseminated by public/social media has made decision-making highly challenging due to the wide availability of noisy, uncertain, or unverified information. Although the issue of uncertainty in information has been studied for several decades, little work has investigated how noisy (or uncertain) or valuable (or credible) information can be formulated into people's… ▽ More The proliferation of information disseminated by public/social media has made decision-making highly challenging due to the wide availability of noisy, uncertain, or unverified information. Although the issue of uncertainty in information has been studied for several decades, little work has investigated how noisy (or uncertain) or valuable (or credible) information can be formulated into people's opinions, modeling uncertainty both in the quantity and quality of evidence leading to a specific opinion. In this work, we model and analyze an opinion and information model by using Subjective Logic where the initial set of evidence is mixed with different types of evidence (i.e., pro vs. con or noisy vs. valuable) which is incorporated into the opinions of original propagators, who propagate information over a network. With the help of an extensive simulation study, we examine how the different ratios of information types or agents' prior belief or topic competence affect the overall information diffusion. Based on our findings, agents' high uncertainty is not necessarily always bad in making a right decision as long as they are competent enough not to be at least biased towards false information (e.g., neutral between two extremes). △ Less

Submitted 17 July, 2018; originally announced July 2018.

Journal ref: IEEE ICC 2018

arXiv:1806.02875 [pdf, ps, other]

An Exploration of Unreliable News Classification in Brazil and The U.S

Authors: Mauricio Gruppi, Benjamin D. Horne, Sibel Adali

Abstract: The propagation of unreliable information is on the rise in many places around the world. This expansion is facilitated by the rapid spread of information and anonymity granted by the Internet. The spread of unreliable information is a wellstudied issue and it is associated with negative social impacts. In a previous work, we have identified significant differences in the structure of news article… ▽ More The propagation of unreliable information is on the rise in many places around the world. This expansion is facilitated by the rapid spread of information and anonymity granted by the Internet. The spread of unreliable information is a wellstudied issue and it is associated with negative social impacts. In a previous work, we have identified significant differences in the structure of news articles from reliable and unreliable sources in the US media. Our goal in this work was to explore such differences in the Brazilian media. We found significant features in two data sets: one with Brazilian news in Portuguese and another one with US news in English. Our results show that features related to the writing style were prominent in both data sets and, despite the language difference, some features have a universal behavior, being significant to both US and Brazilian news articles. Finally, we combined both data sets and used the universal features to build a machine learning classifier to predict the source type of a news article as reliable or unreliable. △ Less

Submitted 7 June, 2018; originally announced June 2018.

Comments: Presented and Peer-Reviewed at NECO 2018

arXiv:1805.05939 [pdf, other]

An Exploration of Verbatim Content Republishing by News Producers

Authors: Benjamin D. Horne, Sibel Adali

Abstract: In today's news ecosystem, news sources emerge frequently and can vary widely in intent. This intent can range from benign to malicious, with many tactics being used to achieve their goals. One lesser studied tactic is content republishing, which can be used to make specific stories seem more important, create uncertainty around an event, or create a perception of credibility for unreliable news s… ▽ More In today's news ecosystem, news sources emerge frequently and can vary widely in intent. This intent can range from benign to malicious, with many tactics being used to achieve their goals. One lesser studied tactic is content republishing, which can be used to make specific stories seem more important, create uncertainty around an event, or create a perception of credibility for unreliable news sources. In this paper, we take a first step in understanding this tactic by exploring verbatim content copying across 92 news producers of various characteristics. We find that content copying occurs more frequently between like-audience sources (eg. alternative news, mainstream news, etc.), but there consistently exists sparse connections between these communities. We also find that despite articles being verbatim, the headlines are often changed. Specifically, we find that mainstream sources change more structural features, while alternative sources change many more content features, often changing the emotional tone and bias of the titles. We conclude that content republishing networks can help identify and label the intent of brand-new news sources using the tight-knit community they belong to. In addition, it is possible to use the network to find important content producers in each community, producers that are used to amplify messages of other sources, and producers that distort the messages of other sources. △ Less

Submitted 15 May, 2018; originally announced May 2018.

Comments: Peer-reviewed by NECO 2018 Workshop

arXiv:1803.10124 [pdf, other]

Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape

Authors: Benjamin D. Horne, William Dron, Sara Khedr, Sibel Adali

Abstract: The complexity and diversity of today's media landscape provides many challenges for researchers studying news producers. These producers use many different strategies to get their message believed by readers through the writing styles they employ, by repetition across different media sources with or without attribution, as well as other mechanisms that are yet to be studied deeply. To better faci… ▽ More The complexity and diversity of today's media landscape provides many challenges for researchers studying news producers. These producers use many different strategies to get their message believed by readers through the writing styles they employ, by repetition across different media sources with or without attribution, as well as other mechanisms that are yet to be studied deeply. To better facilitate systematic studies in this area, we present a large political news data set, containing over 136K news articles, from 92 news sources, collected over 7 months of 2017. These news sources are carefully chosen to include well-established and mainstream sources, maliciously fake sources, satire sources, and hyper-partisan political blogs. In addition to each article we compute 130 content-based and social media engagement features drawn from a wide range of literature on political bias, persuasion, and misinformation. With the release of the data set, we also provide the source code for feature computation. In this paper, we discuss the first release of the data set and demonstrate 4 use cases of the data and features: news characterization, engagement characterization, news attribution and content copying, and discovering news narratives. △ Less

Submitted 16 August, 2018; v1 submitted 27 March, 2018; originally announced March 2018.

Comments: Published at ICWSM 2018. Dataset: https://github.com/BenjaminDHorne/NELA2017-Dataset-v1 Feature Code: https://github.com/BenjaminDHorne/Language-Features-for-News

arXiv:1706.03364 [pdf, ps, other]

Singularities of Restriction Varieties in $OG(k, n)$

Authors: Seçkin Adalı

Abstract: Restriction varieties in the orthogonal Grassmannian are subvarieties of $OG(k, n)$ defined by rank conditions given by a flag that is not necessarily isotropic with respect to the relevant symmetric bilinear form. In particular, Schubert varieties of Type B and D are examples of restriction varieties. In this paper, we introduce a resolution of singularities for restriction varieties in… ▽ More Restriction varieties in the orthogonal Grassmannian are subvarieties of $OG(k, n)$ defined by rank conditions given by a flag that is not necessarily isotropic with respect to the relevant symmetric bilinear form. In particular, Schubert varieties of Type B and D are examples of restriction varieties. In this paper, we introduce a resolution of singularities for restriction varieties in $OG(k, n)$, and give a description of their singular locus by studying components of the exceptional locus of the resolution. △ Less

Submitted 28 July, 2017; v1 submitted 11 June, 2017; originally announced June 2017.

Comments: Revised according to referee's comments

MSC Class: 14M15; 14E15; 32M10

arXiv:1705.06709 [pdf, other]

Learning Spatiotemporal Features for Infrared Action Recognition with 3D Convolutional Neural Networks

Authors: Zhuolin Jiang, Viktor Rozgic, Sancar Adali

Abstract: Infrared (IR) imaging has the potential to enable more robust action recognition systems compared to visible spectrum cameras due to lower sensitivity to lighting conditions and appearance variability. While the action recognition task on videos collected from visible spectrum imaging has received much attention, action recognition in IR videos is significantly less explored. Our objective is to e… ▽ More Infrared (IR) imaging has the potential to enable more robust action recognition systems compared to visible spectrum cameras due to lower sensitivity to lighting conditions and appearance variability. While the action recognition task on videos collected from visible spectrum imaging has received much attention, action recognition in IR videos is significantly less explored. Our objective is to exploit imaging data in this modality for the action recognition task. In this work, we propose a novel two-stream 3D convolutional neural network (CNN) architecture by introducing the discriminative code layer and the corresponding discriminative code loss function. The proposed network processes IR image and the IR-based optical flow field sequences. We pretrain the 3D CNN model on the visible spectrum Sports-1M action dataset and finetune it on the Infrared Action Recognition (InfAR) dataset. To our best knowledge, this is the first application of the 3D CNN to action recognition in the IR domain. We conduct an elaborate analysis of different fusion schemes (weighted average, single and double-layer neural nets) applied to different 3D CNN outputs. Experimental results demonstrate that our approach can achieve state-of-the-art average precision (AP) performances on the InfAR dataset: (1) the proposed two-stream 3D CNN achieves the best reported 77.5% AP, and (2) our 3D CNN model applied to the optical flow fields achieves the best reported single stream 75.42% AP. △ Less

Submitted 18 May, 2017; originally announced May 2017.

arXiv:1705.02673 [pdf, other]

Identifying the social signals that drive online discussions: A case study of Reddit communities

Authors: Benjamin D. Horne, Sibel Adali, Sujoy Sikdar

Abstract: Increasingly people form opinions based on information they consume on online social media. As a result, it is crucial to understand what type of content attracts people's attention on social media and drive discussions. In this paper we focus on online discussions. Can we predict which comments and what content gets the highest attention in an online discussion? How does this content differ from… ▽ More Increasingly people form opinions based on information they consume on online social media. As a result, it is crucial to understand what type of content attracts people's attention on social media and drive discussions. In this paper we focus on online discussions. Can we predict which comments and what content gets the highest attention in an online discussion? How does this content differ from community to community? To accomplish this, we undertake a unique study of Reddit involving a large sample comments from 11 popular subreddits with different properties. We introduce a large number of sentiment, relevance, content analysis features including some novel features customized to reddit. Through a comparative analysis of the chosen subreddits, we show that our models are correctly able to retrieve top replies under a post with great precision. In addition, we explain our findings with a detailed analysis of what distinguishes high scoring posts in different communities that differ along the dimensions of the specificity of topic and style, audience and level of moderation. △ Less

Submitted 7 May, 2017; originally announced May 2017.

Comments: \c{opyright} 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:1703.10570 [pdf, other]

The Impact of Crowds on News Engagement: A Reddit Case Study

Authors: Benjamin D. Horne, Sibel Adali

Abstract: Today, users are reading the news through social platforms. These platforms are built to facilitate crowd engagement, but not necessarily disseminate useful news to inform the masses. Hence, the news that is highly engaged with may not be the news that best informs. While predicting news popularity has been well studied, it has not been studied in the context of crowd manipulations. In this paper,… ▽ More Today, users are reading the news through social platforms. These platforms are built to facilitate crowd engagement, but not necessarily disseminate useful news to inform the masses. Hence, the news that is highly engaged with may not be the news that best informs. While predicting news popularity has been well studied, it has not been studied in the context of crowd manipulations. In this paper, we provide some preliminary results to a longer term project on crowd and platform manipulations of news and news popularity. In particular, we choose to study known features for predicting news popularity and how those features may change on reddit.com, a social platform used commonly for news aggregation. Along with this, we explore ways in which users can alter the perception of news through changing the title of an article. We find that news on reddit is predictable using previously studied sentiment and content features and that posts with titles changed by reddit users tend to be more popular than posts with the original article title. △ Less

Submitted 3 November, 2017; v1 submitted 30 March, 2017; originally announced March 2017.

Comments: Published at The 2nd International Workshop on News and Public Opinion at ICWSM 2017

arXiv:1703.09398 [pdf, other]

This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News

Authors: Benjamin D. Horne, Sibel Adali

Abstract: The problem of fake news has gained a lot of attention as it is claimed to have had a significant impact on 2016 US Presidential Elections. Fake news is not a new problem and its spread in social networks is well-studied. Often an underlying assumption in fake news discussion is that it is written to look like real news, fooling the reader who does not check for reliability of the sources or the a… ▽ More The problem of fake news has gained a lot of attention as it is claimed to have had a significant impact on 2016 US Presidential Elections. Fake news is not a new problem and its spread in social networks is well-studied. Often an underlying assumption in fake news discussion is that it is written to look like real news, fooling the reader who does not check for reliability of the sources or the arguments in its content. Through a unique study of three data sets and features that capture the style and the language of articles, we show that this assumption is not true. Fake news in most cases is more similar to satire than to real news, leading us to conclude that persuasion in fake news is achieved through heuristics rather than the strength of arguments. We show overall title structure and the use of proper nouns in titles are very significant in differentiating fake from real. This leads us to conclude that fake news is targeted for audiences who are not likely to read beyond titles and is aimed at creating mental associations between entities and claims. △ Less

Submitted 28 March, 2017; originally announced March 2017.

Comments: Published at The 2nd International Workshop on News and Public Opinion at ICWSM

arXiv:1611.07636 [pdf, other]

Mechanism Design for Multi-Type Housing Markets

Authors: Sibel Adali, Sujoy Sikdar, Lirong Xia

Abstract: We study multi-type housing markets, where there are $p\ge 2$ types of items, each agent is initially endowed one item of each type, and the goal is to design mechanisms without monetary transfer to (re)allocate items to the agents based on their preferences over bundles of items, such that each agent gets one item of each type. In sharp contrast to classical housing markets, previous studies in m… ▽ More We study multi-type housing markets, where there are $p\ge 2$ types of items, each agent is initially endowed one item of each type, and the goal is to design mechanisms without monetary transfer to (re)allocate items to the agents based on their preferences over bundles of items, such that each agent gets one item of each type. In sharp contrast to classical housing markets, previous studies in multi-type housing markets have been hindered by the lack of natural solution concepts, because the strict core might be empty. We break the barrier in the literature by leveraging AI techniques and making natural assumptions on agents' preferences. We show that when agents' preferences are lexicographic, even with different importance orders, the classical top-trading-cycles mechanism can be extended while preserving most of its nice properties. We also investigate computational complexity of checking whether an allocation is in the strict core and checking whether the strict core is empty. Our results convey an encouragingly positive message: it is possible to design good mechanisms for multi-type housing markets under natural assumptions on preferences. △ Less

Submitted 22 November, 2016; originally announced November 2016.

Comments: full version of the AAAI-17 paper

arXiv:1401.3813 [pdf, other]

Seeded Graph Matching Via Joint Optimization of Fidelity and Commensurability

Authors: Heather Patsolic, Sancar Adali, Joshua T. Vogelstein, Youngser Park, Carey E. Friebe, Gongkai Li, Vince Lyzinski

Abstract: We present a novel approximate graph matching algorithm that incorporates seeded data into the graph matching paradigm. Our Joint Optimization of Fidelity and Commensurability (JOFC) algorithm embeds two graphs into a common Euclidean space where the matching inference task can be performed. Through real and simulated data examples, we demonstrate the versatility of our algorithm in matching graph… ▽ More We present a novel approximate graph matching algorithm that incorporates seeded data into the graph matching paradigm. Our Joint Optimization of Fidelity and Commensurability (JOFC) algorithm embeds two graphs into a common Euclidean space where the matching inference task can be performed. Through real and simulated data examples, we demonstrate the versatility of our algorithm in matching graphs with various characteristics--weightedness, directedness, loopiness, many-to-one and many-to-many matchings, and soft seedings. △ Less

Submitted 8 December, 2019; v1 submitted 15 January, 2014; originally announced January 2014.

Comments: 26 pages, 7 figures. Updated content and added application of simultaneous matching for several time-steps for zebrafish connectomes

arXiv:1306.1977 [pdf, other]

Fidelity-Commensurability Tradeoff in Joint Embedding of Disparate Dissimilarities

Authors: Sancar Adali, Carey E. Priebe

Abstract: In various data settings, it is necessary to compare observations from disparate data sources. We assume the data is in the dissimilarity representation and investigate a joint embedding method that results in a commensurate representation of disparate dissimilarities. We further assume that there are "matched" observations from different conditions which can be considered to be highly similar, fo… ▽ More In various data settings, it is necessary to compare observations from disparate data sources. We assume the data is in the dissimilarity representation and investigate a joint embedding method that results in a commensurate representation of disparate dissimilarities. We further assume that there are "matched" observations from different conditions which can be considered to be highly similar, for the sake of inference. The joint embedding results in the joint optimization of fidelity (preservation of within-condition dissimilarities) and commensurability (preservation of between-condition dissimilarities between matched observations). We show that the tradeoff between these two criteria can be made explicit using weighted raw stress as the objective function for multidimensional scaling. In our investigations, we use a weight parameter, $w$, to control the tradeoff, and choose match detection as the inference task. Our results show weights that are optimal (with respect to the inference task) are different than equal weights for commensurability and fidelity and the proposed weighted embedding scheme provides significant improvements in statistical power. △ Less

Submitted 4 January, 2016; v1 submitted 8 June, 2013; originally announced June 2013.

arXiv:1209.0367 [pdf, other]

Seeded Graph Matching

Authors: Donniell E. Fishkind, Sancar Adali, Heather G. Patsolic, Lingyao Meng, Digvijay Singh, Vince Lyzinski, Carey E. Priebe

Abstract: Given two graphs, the graph matching problem is to align the two vertex sets so as to minimize the number of adjacency disagreements between the two graphs. The seeded graph matching problem is the graph matching problem when we are first given a partial alignment that we are tasked with completing. In this paper, we modify the state-of-the-art approximate graph matching algorithm "FAQ" of Vogelst… ▽ More Given two graphs, the graph matching problem is to align the two vertex sets so as to minimize the number of adjacency disagreements between the two graphs. The seeded graph matching problem is the graph matching problem when we are first given a partial alignment that we are tasked with completing. In this paper, we modify the state-of-the-art approximate graph matching algorithm "FAQ" of Vogelstein et al. (2015) to make it a fast approximate seeded graph matching algorithm, adapt its applicability to include graphs with differently sized vertex sets, and extend the algorithm so as to provide, for each individual vertex, a nomination list of likely matches. We demonstrate the effectiveness of our algorithm via simulation and real data experiments; indeed, knowledge of even a few seeds can be extremely effective when our seeded graph matching algorithm is used to recover a naturally existing alignment that is only partially observed. △ Less

Submitted 10 April, 2018; v1 submitted 3 September, 2012; originally announced September 2012.

Comments: 24 pages, 10 figures

arXiv:1112.5510 [pdf, other]

Manifold Matching: Joint Optimization of Fidelity and Commensurability

Authors: Carey E. Priebe, David J. Marchette, Zhiliang Ma, Sancar Adali

Abstract: Fusion and inference from multiple and massive disparate data sources - the requirement for our most challenging data analysis problems and the goal of our most ambitious statistical pattern recognition methodologies - -has many and varied aspects which are currently the target of intense research and development. One aspect of the overall challenge is manifold matching - identifying embeddings of… ▽ More Fusion and inference from multiple and massive disparate data sources - the requirement for our most challenging data analysis problems and the goal of our most ambitious statistical pattern recognition methodologies - -has many and varied aspects which are currently the target of intense research and development. One aspect of the overall challenge is manifold matching - identifying embeddings of multiple disparate data spaces into the same low-dimensional space where joint inference can be pursued. We investigate this manifold matching task from the perspective of jointly optimizing the fidelity of the embeddings and their commensurability with one another, with a specific statistical inference exploitation task in mind. Our results demonstrate when and why our joint optimization methodology is superior to either version of separate optimization. The methodology is illustrated with simulations and an application in document matching. △ Less

Submitted 22 December, 2011; originally announced December 2011.

Comments: 22 pages, 12 figures

arXiv:1103.1359 [pdf, ps, other]

An Analysis of Optimal Link Bombs

Authors: Sibel Adali, Tina Liu, Malik Magdon-Ismail

Abstract: We analyze the phenomenon of collusion for the purpose of boosting the pagerank of a node in an interlinked environment. We investigate the optimal attack pattern for a group of nodes (attackers) attempting to improve the ranking of a specific node (the victim). We consider attacks where the attackers can only manipulate their own outgoing links. We show that the optimal attacks in this scenario a… ▽ More We analyze the phenomenon of collusion for the purpose of boosting the pagerank of a node in an interlinked environment. We investigate the optimal attack pattern for a group of nodes (attackers) attempting to improve the ranking of a specific node (the victim). We consider attacks where the attackers can only manipulate their own outgoing links. We show that the optimal attacks in this scenario are uncoordinated, i.e. the attackers link directly to the victim and no one else. nodes do not link to each other. We also discuss optimal attack patterns for a group that wants to hide itself by not pointing directly to the victim. In these disguised attacks, the attackers link to nodes $l$ hops away from the victim. We show that an optimal disguised attack exists and how it can be computed. The optimal disguised attack also allows us to find optimal link farm configurations. A link farm can be considered a special case of our approach: the target page of the link farm is the victim and the other nodes in the link farm are the attackers for the purpose of improving the rank of the victim. The target page can however control its own outgoing links for the purpose of improving its own rank, which can be modeled as an optimal disguised attack of 1-hop on itself. Our results are unique in the literature as we show optimality not only in the pagerank score, but also in the rank based on the pagerank score. We further validate our results with experiments on a variety of random graph models. △ Less

Submitted 7 March, 2011; originally announced March 2011.

Comments: Full Version of a version which appeared in AIRweb 2005

Showing 1–30 of 30 results for author: Adali, S