-
Hostile Counterspeech Drives Users From Hate Subreddits
Authors:
Daniel Hickey,
Matheus Schmitz,
Daniel M. T. Fessler,
Paul E. Smaldino,
Kristina Lerman,
Goran Murić,
Keith Burghardt
Abstract:
Counterspeech -- speech that opposes hate speech -- has gained significant attention recently as a strategy to reduce hate on social media. While previous studies suggest that counterspeech can somewhat reduce hate speech, little is known about its effects on participation in online hate communities, nor which counterspeech tactics reduce harmful behavior. We begin to address these gaps by identif…
▽ More
Counterspeech -- speech that opposes hate speech -- has gained significant attention recently as a strategy to reduce hate on social media. While previous studies suggest that counterspeech can somewhat reduce hate speech, little is known about its effects on participation in online hate communities, nor which counterspeech tactics reduce harmful behavior. We begin to address these gaps by identifying 25 large hate communities ("subreddits") within Reddit and analyzing the effect of counterspeech on newcomers within these communities. We first construct a new public dataset of carefully annotated counterspeech and non-counterspeech comments within these subreddits. We use this dataset to train a state-of-the-art counterspeech detection model. Next, we use matching to evaluate the causal effects of hostile and non-hostile counterspeech on the engagement of newcomers in hate subreddits. We find that, while non-hostile counterspeech is ineffective at kee** users from fully disengaging from these hate subreddits, a single hostile counterspeech comment substantially reduces both future likelihood of engagement. While offering nuance to the understanding of counterspeech efficacy, these results a) leave unanswered the question of whether hostile counterspeech dissuades newcomers from participation in online hate writ large, or merely drives them into less-moderated and more extreme hate communities, and b) raises ethical considerations about hostile counterspeech, which is both comparatively common and might exacerbate rather than mitigate the net level of antagonism in society. These findings underscore the importance of future work to improve counterspeech tactics and minimize unintended harm.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Trust and Terror: Hazards in Text Reveal Negatively Biased Credulity and Partisan Negativity Bias
Authors:
Keith Burghardt,
Daniel M. T. Fessler,
Chyna Tang,
Anne Pisor,
Kristina Lerman
Abstract:
Socio-linguistic indicators of text, such as emotion or sentiment, are often extracted using neural networks in order to better understand features of social media. One indicator that is often overlooked, however, is the presence of hazards within text. Recent psychological research suggests that statements about hazards are more believable than statements about benefits (a property known as negat…
▽ More
Socio-linguistic indicators of text, such as emotion or sentiment, are often extracted using neural networks in order to better understand features of social media. One indicator that is often overlooked, however, is the presence of hazards within text. Recent psychological research suggests that statements about hazards are more believable than statements about benefits (a property known as negatively biased credulity), and that political liberals and conservatives differ in how often they share hazards. Here, we develop a new model to detect information concerning hazards, trained on a new collection of annotated X posts, as well as urban legends annotated in previous work. We show that not only does this model perform well (outperforming, e.g., zero-shot human annotator proxies, such as GPT-4) but that the hazard information it extracts is not strongly correlated with other indicators, namely moral outrage, sentiment, emotions, and threat words. (That said, consonant with expectations, hazard information does correlate positively with such emotions as fear, and negatively with emotions like joy.) We then apply this model to three datasets: X posts about COVID-19, X posts about the 2023 Hamas-Israel war, and a new expanded collection of urban legends. From these data, we uncover words associated with hazards unique to each dataset as well as differences in this language between groups of users, such as conservatives and liberals, which informs what these groups perceive as hazards. We further show that information about hazards peaks in frequency after major hazard events, and therefore acts as an automated indicator of such events. Finally, we find that information about hazards is especially prevalent in urban legends, which is consistent with previous work that finds that reports of hazards are more likely to be both believed and transmitted.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
The Peripatetic Hater: Predicting Movement Among Hate Subreddits
Authors:
Daniel Hickey,
Daniel M. T. Fessler,
Kristina Lerman,
Keith Burghardt
Abstract:
Many online hate groups exist to disparage others based on race, gender identity, sex, or other characteristics. The accessibility of these communities allows users to join multiple types of hate groups (e.g., a racist community and misogynistic community), which calls into question whether these peripatetic users could be further radicalized compared to users that stay in one type of hate group.…
▽ More
Many online hate groups exist to disparage others based on race, gender identity, sex, or other characteristics. The accessibility of these communities allows users to join multiple types of hate groups (e.g., a racist community and misogynistic community), which calls into question whether these peripatetic users could be further radicalized compared to users that stay in one type of hate group. However, little is known about the dynamics of joining multiple types of hate groups, nor the effect of these groups on peripatetic users. In this paper, we develop a new method to classify hate subreddits, and the identities they disparage, which we use to better understand how users become peripatetic (join different types of hate subreddits). The hate classification technique utilizes human-validated LLMs to extract the protected identities attacked, if any, across 168 subreddits. We then cluster identity-attacking subreddits to discover three broad categories of hate: racist, anti-LGBTQ, and misogynistic. We show that becoming active in a user's first hate subreddit can cause them to become active in additional hate subreddits of a different category. We also find that users who join additional hate subreddits, especially of a different category, become more active in hate subreddits as a whole and develop a wider hate group lexicon. We are therefore motivated to train an AI model that we find usefully predicts the hate categories users will become active in based on post text read and written. The accuracy of this model may be partly driven by peripatetic users often using the language of hate subreddits they eventually join. Overall, these results highlight the unique risks associated with hate communities on a social media platform, as discussion of alternative targets of hate may lead users to target more protected identities.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
SoMeR: Multi-View User Representation Learning for Social Media
Authors:
Siyi Guo,
Keith Burghardt,
Valeria Pantè,
Kristina Lerman
Abstract:
User representation learning aims to capture user preferences, interests, and behaviors in low-dimensional vector representations. These representations have widespread applications in recommendation systems and advertising; however, existing methods typically rely on specific features like text content, activity patterns, or platform metadata, failing to holistically model user behavior across di…
▽ More
User representation learning aims to capture user preferences, interests, and behaviors in low-dimensional vector representations. These representations have widespread applications in recommendation systems and advertising; however, existing methods typically rely on specific features like text content, activity patterns, or platform metadata, failing to holistically model user behavior across different modalities. To address this limitation, we propose SoMeR, a Social Media user Representation learning framework that incorporates temporal activities, text content, profile information, and network interactions to learn comprehensive user portraits. SoMeR encodes user post streams as sequences of timestamped textual features, uses transformers to embed this along with profile data, and jointly trains with link prediction and contrastive learning objectives to capture user similarity. We demonstrate SoMeR's versatility through two applications: 1) Identifying inauthentic accounts involved in coordinated influence operations by detecting users posting similar content simultaneously, and 2) Measuring increased polarization in online discussions after major events by quantifying how users with different beliefs moved farther apart in the embedding space. SoMeR's ability to holistically model users enables new solutions to important problems around disinformation, societal tensions, and online behavior understanding.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Large Language Models Reveal Information Operation Goals, Tactics, and Narrative Frames
Authors:
Keith Burghardt,
Kai Chen,
Kristina Lerman
Abstract:
Adversarial information operations can destabilize societies by undermining fair elections, manipulating public opinions on policies, and promoting scams. Despite their widespread occurrence and potential impacts, our understanding of influence campaigns is limited by manual analysis of messages and subjective interpretation of their observable behavior. In this paper, we explore whether these lim…
▽ More
Adversarial information operations can destabilize societies by undermining fair elections, manipulating public opinions on policies, and promoting scams. Despite their widespread occurrence and potential impacts, our understanding of influence campaigns is limited by manual analysis of messages and subjective interpretation of their observable behavior. In this paper, we explore whether these limitations can be mitigated with large language models (LLMs), using GPT-3.5 as a case-study for coordinated campaign annotation. We first use GPT-3.5 to scrutinize 126 identified information operations spanning over a decade. We utilize a number of metrics to quantify the close (if imperfect) agreement between LLM and ground truth descriptions. We next extract coordinated campaigns from two large multilingual datasets from X (formerly Twitter) that respectively discuss the 2022 French election and 2023 Balikaran Philippine-U.S. military exercise in 2023. For each coordinated campaign, we use GPT-3.5 to analyze posts related to a specific concern and extract goals, tactics, and narrative frames, both before and after critical events (such as the date of an election). While the GPT-3.5 sometimes disagrees with subjective interpretation, its ability to summarize and interpret demonstrates LLMs' potential to extract higher-order indicators from text to provide a more complete picture of the information campaigns compared to previous methods.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
GET-Tok: A GenAI-Enriched Multimodal TikTok Dataset Documenting the 2022 Attempted Coup in Peru
Authors:
Gabriela Pinto,
Keith Burghardt,
Kristina Lerman,
Emilio Ferrara
Abstract:
TikTok is one of the largest and fastest-growing social media sites in the world. TikTok features, however, such as voice transcripts, are often missing and other important features, such as OCR or video descriptions, do not exist. We introduce the Generative AI Enriched TikTok (GET-Tok) data, a pipeline for collecting TikTok videos and enriched data by augmenting the TikTok Research API with gene…
▽ More
TikTok is one of the largest and fastest-growing social media sites in the world. TikTok features, however, such as voice transcripts, are often missing and other important features, such as OCR or video descriptions, do not exist. We introduce the Generative AI Enriched TikTok (GET-Tok) data, a pipeline for collecting TikTok videos and enriched data by augmenting the TikTok Research API with generative AI models. As a case study, we collect videos about the attempted coup in Peru initiated by its former President, Pedro Castillo, and its accompanying protests. The data includes information on 43,697 videos published from November 20, 2022 to March 1, 2023 (102 days). Generative AI augments the collected data via transcripts of TikTok videos, text descriptions of what is shown in the videos, what text is displayed within the video, and the stances expressed in the video. Overall, this pipeline will contribute to a better understanding of online discussion in a multimodal setting with applications of Generative AI, especially outlining the utility of this pipeline in non-English-language social media. Our code used to produce the pipeline is in a public Github repository: https://github.com/gabbypinto/GET-Tok-Peru.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
IsamasRed: A Public Dataset Tracking Reddit Discussions on Israel-Hamas Conflict
Authors:
Kai Chen,
Zihao He,
Keith Burghardt,
**gxin Zhang,
Kristina Lerman
Abstract:
The conflict between Israel and Palestinians significantly escalated after the October 7, 2023 Hamas attack, capturing global attention. To understand the public discourse on this conflict, we present a meticulously compiled dataset-IsamasRed-comprising nearly 400,000 conversations and over 8 million comments from Reddit, spanning from August 2023 to November 2023. We introduce an innovative keywo…
▽ More
The conflict between Israel and Palestinians significantly escalated after the October 7, 2023 Hamas attack, capturing global attention. To understand the public discourse on this conflict, we present a meticulously compiled dataset-IsamasRed-comprising nearly 400,000 conversations and over 8 million comments from Reddit, spanning from August 2023 to November 2023. We introduce an innovative keyword extraction framework leveraging a large language model to effectively identify pertinent keywords, ensuring a comprehensive data collection. Our initial analysis on the dataset, examining topics, controversy, emotional and moral language trends over time, highlights the emotionally charged and complex nature of the discourse. This dataset aims to enrich the understanding of online discussions, shedding light on the complex interplay between ideology, sentiment, and community engagement in digital spaces.
△ Less
Submitted 16 April, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
Unmasking the Web of Deceit: Uncovering Coordinated Activity to Expose Information Operations on Twitter
Authors:
Luca Luceri,
Valeria Pantè,
Keith Burghardt,
Emilio Ferrara
Abstract:
Social media platforms, particularly Twitter, have become pivotal arenas for influence campaigns, often orchestrated by state-sponsored information operations (IOs). This paper delves into the detection of key players driving IOs by employing similarity graphs constructed from behavioral pattern data. We unveil that well-known, yet underutilized network properties can help accurately identify coor…
▽ More
Social media platforms, particularly Twitter, have become pivotal arenas for influence campaigns, often orchestrated by state-sponsored information operations (IOs). This paper delves into the detection of key players driving IOs by employing similarity graphs constructed from behavioral pattern data. We unveil that well-known, yet underutilized network properties can help accurately identify coordinated IO drivers. Drawing from a comprehensive dataset of 49 million tweets from six countries, which includes multiple verified IOs, our study reveals that traditional network filtering techniques do not consistently pinpoint IO drivers across campaigns. We first propose a framework based on node pruning that emerges superior, particularly when combining multiple behavioral indicators across different networks. Then, we introduce a supervised machine learning model that harnesses a vector representation of the fused similarity network. This model, which boasts a precision exceeding 0.95, adeptly classifies IO drivers on a global scale and reliably forecasts their temporal engagements. Our findings are crucial in the fight against deceptive influence campaigns on social media, hel** us better understand and detect them.
△ Less
Submitted 15 October, 2023;
originally announced October 2023.
-
Socio-Linguistic Characteristics of Coordinated Inauthentic Accounts
Authors:
Keith Burghardt,
Ashwin Rao,
Siyi Guo,
Zihao He,
Georgios Chochlakis,
Baruah Sabyasachee,
Andrew Rojecki,
Shri Narayanan,
Kristina Lerman
Abstract:
Online manipulation is a pressing concern for democracies, but the actions and strategies of coordinated inauthentic accounts, which have been used to interfere in elections, are not well understood. We analyze a five million-tweet multilingual dataset related to the 2017 French presidential election, when a major information campaign led by Russia called "#MacronLeaks" took place. We utilize heur…
▽ More
Online manipulation is a pressing concern for democracies, but the actions and strategies of coordinated inauthentic accounts, which have been used to interfere in elections, are not well understood. We analyze a five million-tweet multilingual dataset related to the 2017 French presidential election, when a major information campaign led by Russia called "#MacronLeaks" took place. We utilize heuristics to identify coordinated inauthentic accounts and detect attitudes, concerns and emotions within their tweets, collectively known as socio-linguistic characteristics. We find that coordinated accounts retweet other coordinated accounts far more than expected by chance, while being exceptionally active just before the second round of voting. Concurrently, socio-linguistic characteristics reveal that coordinated accounts share tweets promoting a candidate at three times the rate of non-coordinated accounts. Coordinated account tactics also varied in time to reflect news events and rounds of voting. Our analysis highlights the utility of socio-linguistic characteristics to inform researchers about tactics of coordinated accounts and how these may feed into online social manipulation.
△ Less
Submitted 30 May, 2023; v1 submitted 19 May, 2023;
originally announced May 2023.
-
Auditing Elon Musk's Impact on Hate Speech and Bots
Authors:
Daniel Hickey,
Matheus Schmitz,
Daniel Fessler,
Paul Smaldino,
Goran Muric,
Keith Burghardt
Abstract:
On October 27th, 2022, Elon Musk purchased Twitter, becoming its new CEO and firing many top executives in the process. Musk listed fewer restrictions on content moderation and removal of spam bots among his goals for the platform. Given findings of prior research on moderation and hate speech in online communities, the promise of less strict content moderation poses the concern that hate will ris…
▽ More
On October 27th, 2022, Elon Musk purchased Twitter, becoming its new CEO and firing many top executives in the process. Musk listed fewer restrictions on content moderation and removal of spam bots among his goals for the platform. Given findings of prior research on moderation and hate speech in online communities, the promise of less strict content moderation poses the concern that hate will rise on Twitter. We examine the levels of hate speech and prevalence of bots before and after Musk's acquisition of the platform. We find that hate speech rose dramatically upon Musk purchasing Twitter and the prevalence of most types of bots increased, while the prevalence of astroturf bots decreased.
△ Less
Submitted 28 January, 2024; v1 submitted 8 April, 2023;
originally announced April 2023.
-
Clique Densification in Networks
Authors:
Haochen Pi,
Keith Burghardt,
Allon G. Percus,
Kristina Lerman
Abstract:
Real-world networks are rarely static. Recently, there has been increasing interest in both network growth and network densification, in which the number of edges scales superlinearly with the number of nodes. Less studied but equally important, however, are scaling laws of higher-order cliques, which can drive clustering and network redundancy. In this paper, we study how cliques grow with networ…
▽ More
Real-world networks are rarely static. Recently, there has been increasing interest in both network growth and network densification, in which the number of edges scales superlinearly with the number of nodes. Less studied but equally important, however, are scaling laws of higher-order cliques, which can drive clustering and network redundancy. In this paper, we study how cliques grow with network size, by analyzing several empirical networks from emails to Wikipedia interactions. Our results show superlinear scaling laws whose exponents increase with clique size, in contrast to predictions from a previous model. We then show that these results are in qualitative agreement with a new model that we propose, the Local Preferential Attachment Model, where an incoming node links not only to a target node but also to its higher-degree neighbors. Our results provide new insights into how networks grow and where network redundancy occurs.
△ Less
Submitted 7 April, 2023;
originally announced April 2023.
-
No Love Among Haters: Negative Interactions Reduce Hate Community Engagement
Authors:
Daniel Hickey,
Matheus Schmitz,
Daniel Fessler,
Paul Smaldino,
Goran Muric,
Keith Burghardt
Abstract:
While online hate groups pose significant risks to the health of online platforms and safety of marginalized groups, little is known about what causes users to become active in hate groups and the effect of social interactions on furthering their engagement. We address this gap by first develo** tools to find hate communities within Reddit, and then augment 11 subreddits extracted with 14 known…
▽ More
While online hate groups pose significant risks to the health of online platforms and safety of marginalized groups, little is known about what causes users to become active in hate groups and the effect of social interactions on furthering their engagement. We address this gap by first develo** tools to find hate communities within Reddit, and then augment 11 subreddits extracted with 14 known hateful subreddits (25 in total). Using causal inference methods, we evaluate the effect of replies on engagement in hateful subreddits by comparing users who receive replies to their first comment (the treatment) to equivalent control users who do not. We find users who receive replies are less likely to become engaged in hateful subreddits than users who do not, while the opposite effect is observed for a matched sample of similar-sized non-hateful subreddits. Using the Google Perspective API and VADER, we discover that hateful community first-repliers are more toxic, negative, and attack the posters more often than non-hateful first-repliers. In addition, we uncover a negative correlation between engagement and attacks or toxicity of first-repliers. We simulate the cumulative engagement of hateful and non-hateful subreddits under the contra-positive scenario of friendly first-replies, finding that attacks dramatically reduce engagement in hateful subreddits. These results counter-intuitively imply that, although under-moderated communities allow hate to fester, the resulting environment is such that direct social interaction does not encourage further participation, thus endogenously constraining the harmful role that these communities could play as recruitment venues for antisocial beliefs.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Data-Driven Estimation of Heterogeneous Treatment Effects
Authors:
Christopher Tran,
Keith Burghardt,
Kristina Lerman,
Elena Zheleva
Abstract:
Estimating how a treatment affects different individuals, known as heterogeneous treatment effect estimation, is an important problem in empirical sciences. In the last few years, there has been a considerable interest in adapting machine learning algorithms to the problem of estimating heterogeneous effects from observational and experimental data. However, these algorithms often make strong assu…
▽ More
Estimating how a treatment affects different individuals, known as heterogeneous treatment effect estimation, is an important problem in empirical sciences. In the last few years, there has been a considerable interest in adapting machine learning algorithms to the problem of estimating heterogeneous effects from observational and experimental data. However, these algorithms often make strong assumptions about the observed features in the data and ignore the structure of the underlying causal model, which can lead to biased estimation. At the same time, the underlying causal mechanism is rarely known in real-world datasets, making it hard to take it into consideration. In this work, we provide a survey of state-of-the-art data-driven methods for heterogeneous treatment effect estimation using machine learning, broadly categorizing them as methods that focus on counterfactual prediction and methods that directly estimate the causal effect. We also provide an overview of a third category of methods which rely on structural causal models and learn the model structure from data. Our empirical evaluation under various underlying structural model mechanisms shows the advantages and deficiencies of existing estimators and of the metrics for measuring their performance.
△ Less
Submitted 16 January, 2023;
originally announced January 2023.
-
Decomposing the Fundamentals of Creepy Stories
Authors:
Sakshi Goel,
Haripriya Dharmala,
Yuchen Zhang,
Keith Burghardt
Abstract:
Fear is a universal concept; people crave it in urban legends, scary movies, and modern stories. Open questions remain, however, about why these stories are scary and more generally what scares people. In this study, we explore these questions by analyzing tens of thousands of scary stories on forums (known as subreddits) in a social media website, Reddit. We first explore how writing styles have…
▽ More
Fear is a universal concept; people crave it in urban legends, scary movies, and modern stories. Open questions remain, however, about why these stories are scary and more generally what scares people. In this study, we explore these questions by analyzing tens of thousands of scary stories on forums (known as subreddits) in a social media website, Reddit. We first explore how writing styles have evolved to keep these stories fresh before we analyze the stable core techniques writers use to make stories scary. We find that writers have changed the themes of their stories over years from haunted houses to school-related themes, body horror, and diseases. Yet some features remain stable; words associated with pseudo-human nouns, such as clown or devil are more common in scary stories than baselines. In addition, we collect a range of datasets that annotate sentences containing fear. We use these data to develop a high-accuracy fear detection neural network model, which is used to quantify where people express fear in scary stories. We find that sentences describing fear, and words most often seen in scary stories, spike at particular points in a story, possibly as a way to keep the readers on the edge of their seats until the story's conclusion. These results provide a new understanding of how authors cater to their readers, and how fear may manifest in stories.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
Using Emotion Embeddings to Transfer Knowledge Between Emotions, Languages, and Annotation Formats
Authors:
Georgios Chochlakis,
Gireesh Mahajan,
Sabyasachee Baruah,
Keith Burghardt,
Kristina Lerman,
Shrikanth Narayanan
Abstract:
The need for emotional inference from text continues to diversify as more and more disciplines integrate emotions into their theories and applications. These needs include inferring different emotion types, handling multiple languages, and different annotation formats. A shared model between different configurations would enable the sharing of knowledge and a decrease in training costs, and would…
▽ More
The need for emotional inference from text continues to diversify as more and more disciplines integrate emotions into their theories and applications. These needs include inferring different emotion types, handling multiple languages, and different annotation formats. A shared model between different configurations would enable the sharing of knowledge and a decrease in training costs, and would simplify the process of deploying emotion recognition models in novel environments. In this work, we study how we can build a single model that can transition between these different configurations by leveraging multilingual models and Demux, a transformer-based model whose input includes the emotions of interest, enabling us to dynamically change the emotions predicted by the model. Demux also produces emotion embeddings, and performing operations on them allows us to transition to clusters of emotions by pooling the embeddings of each cluster. We show that Demux can simultaneously transfer knowledge in a zero-shot manner to a new language, to a novel annotation format and to unseen emotions. Code is available at https://github.com/gchochla/Demux-MEmo .
△ Less
Submitted 11 March, 2023; v1 submitted 31 October, 2022;
originally announced November 2022.
-
Leveraging Label Correlations in a Multi-label Setting: A Case Study in Emotion
Authors:
Georgios Chochlakis,
Gireesh Mahajan,
Sabyasachee Baruah,
Keith Burghardt,
Kristina Lerman,
Shrikanth Narayanan
Abstract:
Detecting emotions expressed in text has become critical to a range of fields. In this work, we investigate ways to exploit label correlations in multi-label emotion recognition models to improve emotion detection. First, we develop two modeling approaches to the problem in order to capture word associations of the emotion words themselves, by either including the emotions in the input, or by leve…
▽ More
Detecting emotions expressed in text has become critical to a range of fields. In this work, we investigate ways to exploit label correlations in multi-label emotion recognition models to improve emotion detection. First, we develop two modeling approaches to the problem in order to capture word associations of the emotion words themselves, by either including the emotions in the input, or by leveraging Masked Language Modeling (MLM). Second, we integrate pairwise constraints of emotion representations as regularization terms alongside the classification loss of the models. We split these terms into two categories, local and global. The former dynamically change based on the gold labels, while the latter remain static during training. We demonstrate state-of-the-art performance across Spanish, English, and Arabic in SemEval 2018 Task 1 E-c using monolingual BERT-based models. On top of better performance, we also demonstrate improved robustness. Code is available at https://github.com/gchochla/Demux-MEmo.
△ Less
Submitted 11 March, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
-
Analyzing urban scaling laws in the United States over 115 years
Authors:
Keith Burghardt,
Johannes H. Uhl,
Kristina Lerman,
Stefan Leyk
Abstract:
The scaling relations between city attributes and population are emergent and ubiquitous aspects of urban growth. Quantifying these relations and understanding their theoretical foundation, however, is difficult due to the challenge of defining city boundaries and a lack of historical data to study city dynamics over time and space. To address this issue, we analyze scaling between city infrastruc…
▽ More
The scaling relations between city attributes and population are emergent and ubiquitous aspects of urban growth. Quantifying these relations and understanding their theoretical foundation, however, is difficult due to the challenge of defining city boundaries and a lack of historical data to study city dynamics over time and space. To address this issue, we analyze scaling between city infrastructure and population across 857 United States metropolitan areas over an unprecedented 115 years using dasymetrically refined historical population estimates, historical urban road network models, and multi-temporal settlement data to define dynamic city boundaries based on settlement density. We demonstrate the clearest evidence that urban scaling exponents can closely match theoretical models over a century if cities are defined as dense settlement patches. Despite the close quantitative agreement with theory, the empirical scaling relations unexpectedly vary across regions. Our analysis of scaling coefficients, meanwhile, reveals that a city in 2015 uses more developed land and kilometers of road than a city with a similar population in 1900, which has serious implications for urban development and impacts on the local environment. Overall, our results offer a new way to study urban systems based on novel, geohistorical data.
△ Less
Submitted 29 January, 2023; v1 submitted 22 September, 2022;
originally announced September 2022.
-
Quantifying How Hateful Communities Radicalize Online Users
Authors:
Matheus Schmitz,
Keith Burghardt,
Goran Muric
Abstract:
While online social media offers a way for ignored or stifled voices to be heard, it also allows users a platform to spread hateful speech. Such speech usually originates in fringe communities, yet it can spill over into mainstream channels. In this paper, we measure the impact of joining fringe hateful communities in terms of hate speech propagated to the rest of the social network. We leverage d…
▽ More
While online social media offers a way for ignored or stifled voices to be heard, it also allows users a platform to spread hateful speech. Such speech usually originates in fringe communities, yet it can spill over into mainstream channels. In this paper, we measure the impact of joining fringe hateful communities in terms of hate speech propagated to the rest of the social network. We leverage data from Reddit to assess the effect of joining one type of echo chamber: a digital community of like-minded users exhibiting hateful behavior. We measure members' usage of hate speech outside the studied community before and after they become active participants. Using Interrupted Time Series (ITS) analysis as a causal inference method, we gauge the spillover effect, in which hateful language from within a certain community can spread outside that community by using the level of out-of-community hate word usage as a proxy for learned hate. We investigate four different Reddit sub-communities (subreddits) covering three areas of hate speech: racism, misogyny and fat-shaming. In all three cases we find an increase in hate speech outside the originating community, implying that joining such community leads to a spread of hate speech throughout the platform. Moreover, users are found to pick up this new hateful speech for months after initially joining the community. We show that the harmful speech does not remain contained within the community. Our results provide new evidence of the harmful effects of echo chambers and the potential benefit of moderating them to reduce adoption of hateful speech.
△ Less
Submitted 2 October, 2022; v1 submitted 18 September, 2022;
originally announced September 2022.
-
Inferring topological transitions in pattern-forming processes with self-supervised learning
Authors:
Marcin Abram,
Keith Burghardt,
Greg Ver Steeg,
Aram Galstyan,
Remi Dingreville
Abstract:
The identification and classification of transitions in topological and microstructural regimes in pattern-forming processes are critical for understanding and fabricating microstructurally precise novel materials in many application domains. Unfortunately, relevant microstructure transitions may depend on process parameters in subtle and complex ways that are not captured by the classic theory of…
▽ More
The identification and classification of transitions in topological and microstructural regimes in pattern-forming processes are critical for understanding and fabricating microstructurally precise novel materials in many application domains. Unfortunately, relevant microstructure transitions may depend on process parameters in subtle and complex ways that are not captured by the classic theory of phase transition. While supervised machine learning methods may be useful for identifying transition regimes, they need labels which require prior knowledge of order parameters or relevant structures describing these transitions. Motivated by the universality principle for dynamical systems, we instead use a self-supervised approach to solve the inverse problem of predicting process parameters from observed microstructures using neural networks. This approach does not require predefined, labeled data about the different classes of microstructural patterns or about the target task of predicting microstructure transitions. We show that the difficulty of performing the inverse-problem prediction task is related to the goal of discovering microstructure regimes, because qualitative changes in microstructural patterns correspond to changes in uncertainty predictions for our self-supervised problem. We demonstrate the value of our approach by automatically discovering transitions in microstructural regimes in two distinct pattern-forming processes: the spinodal decomposition of a two-phase mixture and the formation of concentration modulations of binary alloys during physical vapor deposition of thin films. This approach opens a promising path forward for discovering and understanding unseen or hard-to-discern transition regimes, and ultimately for controlling complex pattern-forming processes.
△ Less
Submitted 10 August, 2022; v1 submitted 18 March, 2022;
originally announced March 2022.
-
Emotion Regulation and Dynamics of Moral Concerns During the Early COVID-19 Pandemic
Authors:
Siyi Guo,
Keith Burghardt,
Ashwin Rao,
Kristina Lerman
Abstract:
The COVID-19 pandemic has upended daily life around the globe, posing a threat to public health. Intuitively, we expect that surging cases and deaths would lead to fear, distress and other negative emotions. However, using state-of-the-art methods to measure sentiment, emotions, and moral concerns in social media messages posted in the early stage of the pandemic, we see a counter-intuitive rise i…
▽ More
The COVID-19 pandemic has upended daily life around the globe, posing a threat to public health. Intuitively, we expect that surging cases and deaths would lead to fear, distress and other negative emotions. However, using state-of-the-art methods to measure sentiment, emotions, and moral concerns in social media messages posted in the early stage of the pandemic, we see a counter-intuitive rise in positive affect. We hypothesize that the increase of positivity is associated with a decrease of uncertainty and emotion regulation. Finally, we identify a partisan divide in moral and emotional reactions that emerged after the first US death. Overall, these results show how collective emotional states have changed since the pandemic began, and how social media can provide a useful tool to understand, and even regulate, diverse patterns underlying human affect.
△ Less
Submitted 11 April, 2022; v1 submitted 7 March, 2022;
originally announced March 2022.
-
Emergent Instabilities in Algorithmic Feedback Loops
Authors:
Keith Burghardt,
Kristina Lerman
Abstract:
Algorithms that aid human tasks, such as recommendation systems, are ubiquitous. They appear in everything from social media to streaming videos to online shop**. However, the feedback loop between people and algorithms is poorly understood and can amplify cognitive and social biases (algorithmic confounding), leading to unexpected outcomes. In this work, we explore algorithmic confounding in co…
▽ More
Algorithms that aid human tasks, such as recommendation systems, are ubiquitous. They appear in everything from social media to streaming videos to online shop**. However, the feedback loop between people and algorithms is poorly understood and can amplify cognitive and social biases (algorithmic confounding), leading to unexpected outcomes. In this work, we explore algorithmic confounding in collaborative filtering-based recommendation algorithms through teacher-student learning simulations. Namely, a student collaborative filtering-based model, trained on simulated choices, is used by the recommendation algorithm to recommend items to agents. Agents might choose some of these items, according to an underlying teacher model, with new choices then fed back into the student model as new training data (approximating online machine learning). These simulations demonstrate how algorithmic confounding produces erroneous recommendations which in turn lead to instability, i.e., wide variations in an item's popularity between each simulation realization. We use the simulations to demonstrate a novel approach to training collaborative filtering models that can create more stable and accurate recommendations. Our methodology is general enough that it can be extended to other socio-technical systems in order to better quantify and improve the stability of algorithms. These results highlight the need to account for emergent behaviors from interactions between people and algorithms.
△ Less
Submitted 18 January, 2022;
originally announced January 2022.
-
Heterogeneous Effects of Software Patches in a Multiplayer Online Battle Arena Game
Authors:
Yuzi He,
Christopher Tran,
Julie Jiang,
Keith Burghardt,
Emilio Ferrara,
Elena Zheleva,
Kristina Lerman
Abstract:
The popularity of online gaming has grown dramatically, driven in part by streaming and the billion-dollar e-sports industry. Online games regularly update their software to fix bugs, add functionality that improve the game's look and feel, and change the game mechanics to keep the games fun and challenging. An open question, however, is the impact of these changes on player performance and game b…
▽ More
The popularity of online gaming has grown dramatically, driven in part by streaming and the billion-dollar e-sports industry. Online games regularly update their software to fix bugs, add functionality that improve the game's look and feel, and change the game mechanics to keep the games fun and challenging. An open question, however, is the impact of these changes on player performance and game balance, as well as how players adapt to these sudden changes. To address these questions, we use causal inference to measure the impact of software patches to League of Legends, a popular team-based multiplayer online game. We show that game patches have substantially different impacts on players depending on their skill level and whether they take breaks between games. We find that the gap between good and bad players increases after a patch, despite efforts to make gameplay more equal. Moreover, longer between-game breaks tend to improve player performance after patches. Overall, our results highlight the utility of causal inference, and specifically heterogeneous treatment effect estimation, as a tool to quantify the complex mechanisms of game balance and its interplay with players' performance.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
Detecting Anti-Vaccine Users on Twitter
Authors:
Matheus Schmitz,
Goran Murić,
Keith Burghardt
Abstract:
Vaccine hesitancy, which has recently been driven by online narratives, significantly degrades the efficacy of vaccination strategies, such as those for COVID-19. Despite broad agreement in the medical community about the safety and efficacy of available vaccines, a large number of social media users continue to be inundated with false information about vaccines and are indecisive or unwilling to…
▽ More
Vaccine hesitancy, which has recently been driven by online narratives, significantly degrades the efficacy of vaccination strategies, such as those for COVID-19. Despite broad agreement in the medical community about the safety and efficacy of available vaccines, a large number of social media users continue to be inundated with false information about vaccines and are indecisive or unwilling to be vaccinated. The goal of this study is to better understand anti-vaccine sentiment by develo** a system capable of automatically identifying the users responsible for spreading anti-vaccine narratives. We introduce a publicly available Python package capable of analyzing Twitter profiles to assess how likely that profile is to share anti-vaccine sentiment in the future. The software package is built using text embedding methods, neural networks, and automated dataset generation and is trained on several million tweets. We find this model can accurately detect anti-vaccine users up to a year before they tweet anti-vaccine hashtags or keywords. We also show examples of how text analysis helps us understand anti-vaccine discussions by detecting moral and emotional differences between anti-vaccine spreaders on Twitter and regular users. Our results will help researchers and policy-makers understand how users become anti-vaccine and what they discuss on Twitter. Policy-makers can utilize this information for better targeted campaigns that debunk harmful anti-vaccination myths.
△ Less
Submitted 3 November, 2022; v1 submitted 21 October, 2021;
originally announced October 2021.
-
DoGR: Disaggregated Gaussian Regression for Reproducible Analysis of Heterogeneous Data
Authors:
Nazanin Alipourfard,
Keith Burghardt,
Kristina Lerman
Abstract:
Quantitative analysis of large-scale data is often complicated by the presence of diverse subgroups, which reduce the accuracy of inferences they make on held-out data. To address the challenge of heterogeneous data analysis, we introduce DoGR, a method that discovers latent confounders by simultaneously partitioning the data into overlap** clusters (disaggregation) and modeling the behavior wit…
▽ More
Quantitative analysis of large-scale data is often complicated by the presence of diverse subgroups, which reduce the accuracy of inferences they make on held-out data. To address the challenge of heterogeneous data analysis, we introduce DoGR, a method that discovers latent confounders by simultaneously partitioning the data into overlap** clusters (disaggregation) and modeling the behavior within them (regression). When applied to real-world data, our method discovers meaningful clusters and their characteristic behaviors, thus giving insight into group differences and their impact on the outcome of interest. By accounting for latent confounders, our framework facilitates exploratory analysis of noisy, heterogeneous data and can be used to learn predictive models that better generalize to new data. We provide the code to enable others to use DoGR within their data analytic workflows.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
Road Network Evolution in the Urban and Rural United States Since 1900
Authors:
Keith Burghardt,
Johannes Uhl,
Kristina Lerman,
Stefan Leyk
Abstract:
Road networks represent a key component of human settlements, such as cities, towns, and villages, that mediate pollution and congestion, as well as economic development. However, little is known about the long-term development trajectories of road networks in rural and urban settings. We leverage novel spatial data sources to reconstruct and analyze road networks in more than 850 US cities and ov…
▽ More
Road networks represent a key component of human settlements, such as cities, towns, and villages, that mediate pollution and congestion, as well as economic development. However, little is known about the long-term development trajectories of road networks in rural and urban settings. We leverage novel spatial data sources to reconstruct and analyze road networks in more than 850 US cities and over 2,500 US counties since 1900. Our analysis reveals significant variations in the structure of roads both within cities and across the conterminous US. Despite differences in the evolution of these networks, there are commonalities and strong geographic patterns. These results persist across the rural-urban continuum and are therefore not just a product of accelerated urban growth. These findings refine and extend existing knowledge and illuminate the need for policies for urban and rural planning including the critical assessment of new development trends.
△ Less
Submitted 16 March, 2022; v1 submitted 30 August, 2021;
originally announced August 2021.
-
Limiting Tags Fosters Efficiency
Authors:
Tiago Santos,
Keith Burghardt,
Kristina Lerman,
Denis Helic
Abstract:
Tagging facilitates information retrieval in social media and other online communities by allowing users to organize and describe online content. Researchers found that the efficiency of tagging systems steadily decreases over time, because tags become less precise in identifying specific documents, i.e., they lose their descriptiveness. However, previous works did not answer how or even whether c…
▽ More
Tagging facilitates information retrieval in social media and other online communities by allowing users to organize and describe online content. Researchers found that the efficiency of tagging systems steadily decreases over time, because tags become less precise in identifying specific documents, i.e., they lose their descriptiveness. However, previous works did not answer how or even whether community managers can improve the efficiency of tags. In this work, we use information-theoretic measures to track the descriptive and retrieval efficiency of tags on Stack Overflow, a question-answering system that strictly limits the number of tags users can specify per question. We observe that tagging efficiency stabilizes over time, while tag content and descriptiveness both increase. To explain this observation, we hypothesize that limiting the number of tags fosters novelty and diversity in tag usage, two properties which are both beneficial for tagging efficiency. To provide qualitative evidence supporting our hypothesis, we present a statistical model of tagging that demonstrates how novelty and diversity lead to greater tag efficiency in the long run. Our work offers insights into policies to improve information organization and retrieval in online communities.
△ Less
Submitted 2 April, 2021;
originally announced April 2021.
-
A Model of Densifying Collaboration Networks
Authors:
Keith A. Burghardt,
Allon G. Percus,
Kristina Lerman
Abstract:
Research collaborations provide the foundation for scientific advances, but we have only recently begun to understand how they form and grow on a global scale. Here we analyze a model of the growth of research collaboration networks to explain the empirical observations that the number of collaborations scales superlinearly with institution size, though at different rates (heterogeneous densificat…
▽ More
Research collaborations provide the foundation for scientific advances, but we have only recently begun to understand how they form and grow on a global scale. Here we analyze a model of the growth of research collaboration networks to explain the empirical observations that the number of collaborations scales superlinearly with institution size, though at different rates (heterogeneous densification), the number of institutions grows as a power of the number of researchers (Heaps' law) and institution sizes approximate Zipf's law. This model has three mechanisms: (i) researchers are preferentially hired by large institutions, (ii) new institutions trigger more potential institutions, and (iii) researchers collaborate with friends-of-friends. We show agreement between these assumptions and empirical data, through analysis of co-authorship networks spanning two centuries. We then develop a theoretical understanding of this model, which reveals emergent heterogeneous scaling such that the number of collaborations between institutions scale with an institution's size.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Political Partisanship and Anti-Science Attitudes in Online Discussions about Covid-19
Authors:
Ashwin Rao,
Fred Morstatter,
Minda Hu,
Emily Chen,
Keith Burghardt,
Emilio Ferrara,
Kristina Lerman
Abstract:
The novel coronavirus pandemic continues to ravage communities across the US. Opinion surveys identified importance of political ideology in sha** perceptions of the pandemic and compliance with preventive measures. Here, we use social media data to study complexity of polarization. We analyze a large dataset of tweets related to the pandemic collected between January and May of 2020, and develo…
▽ More
The novel coronavirus pandemic continues to ravage communities across the US. Opinion surveys identified importance of political ideology in sha** perceptions of the pandemic and compliance with preventive measures. Here, we use social media data to study complexity of polarization. We analyze a large dataset of tweets related to the pandemic collected between January and May of 2020, and develop methods to classify the ideological alignment of users along the moderacy (hardline vs moderate), political (liberal vs conservative) and science (anti-science vs pro-science) dimensions. While polarization along the science and political dimensions are correlated, politically moderate users are more likely to be aligned with the pro-science views, and politically hardline users with anti-science views. Contrary to expectations, we do not find that polarization grows over time; instead, we see increasing activity by moderate pro-science users. We also show that anti-science conservatives tend to tweet from the Southern US, while anti-science moderates from the Western states. Our findings shed light on the multi-dimensional nature of polarization, and the feasibility of tracking polarized opinions about the pandemic across time and space through social media data.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Inherent Trade-offs in the Fair Allocation of Treatments
Authors:
Yuzi He,
Keith Burghardt,
Siyi Guo,
Kristina Lerman
Abstract:
Explicit and implicit bias clouds human judgement, leading to discriminatory treatment of minority groups. A fundamental goal of algorithmic fairness is to avoid the pitfalls in human judgement by learning policies that improve the overall outcomes while providing fair treatment to protected classes. In this paper, we propose a causal framework that learns optimal intervention policies from data s…
▽ More
Explicit and implicit bias clouds human judgement, leading to discriminatory treatment of minority groups. A fundamental goal of algorithmic fairness is to avoid the pitfalls in human judgement by learning policies that improve the overall outcomes while providing fair treatment to protected classes. In this paper, we propose a causal framework that learns optimal intervention policies from data subject to fairness constraints. We define two measures of treatment bias and infer best treatment assignment that minimizes the bias while optimizing overall outcome. We demonstrate that there is a dilemma of balancing fairness and overall benefit; however, allowing preferential treatment to protected classes in certain circumstances (affirmative action) can dramatically improve the overall benefit while also preserving fairness. We apply our framework to data containing student outcomes on standardized tests and show how it can be used to design real-world policies that fairly improve student test scores. Our framework provides a principled way to learn fair treatment policies in real-world settings.
△ Less
Submitted 30 October, 2020;
originally announced October 2020.
-
Origins of Algorithmic Instabilities in Crowdsourced Ranking
Authors:
Keith Burghardt,
Tad Hogg,
Raissa M. D'Souza,
Kristina Lerman,
Marton Posfai
Abstract:
Crowdsourcing systems aggregate decisions of many people to help users quickly identify high-quality options, such as the best answers to questions or interesting news stories. A long-standing issue in crowdsourcing is how option quality and human judgement heuristics interact to affect collective outcomes, such as the perceived popularity of options. We address this limitation by conducting a con…
▽ More
Crowdsourcing systems aggregate decisions of many people to help users quickly identify high-quality options, such as the best answers to questions or interesting news stories. A long-standing issue in crowdsourcing is how option quality and human judgement heuristics interact to affect collective outcomes, such as the perceived popularity of options. We address this limitation by conducting a controlled experiment where subjects choose between two ranked options whose quality can be independently varied. We use this data to construct a model that quantifies how judgement heuristics and option quality combine when deciding between two options. The model reveals popularity-ranking can be unstable: unless the quality difference between the two options is sufficiently high, the higher quality option is not guaranteed to be eventually ranked on top. To rectify this instability, we create an algorithm that accounts for judgement heuristics to infer the best option and rank it first. This algorithm is guaranteed to be optimal if data matches the model. When the data does not match the model, however, simulations show that in practice this algorithm performs better or at least as well as popularity-based and recency-based ranking for any two-choice question. Our work suggests that algorithms relying on inference of mathematical models of user behavior can substantially improve outcomes in crowdsourcing systems.
△ Less
Submitted 23 October, 2020;
originally announced October 2020.
-
Having a Bad Day? Detecting the Impact of Atypical Life Events Using Wearable Sensors
Authors:
Keith Burghardt,
Nazgol Tavabi,
Emilio Ferrara,
Shrikanth Narayanan,
Kristina Lerman
Abstract:
Life events can dramatically affect our psychological state and work performance. Stress, for example, has been linked to professional dissatisfaction, increased anxiety, and workplace burnout. We explore the impact of positive and negative life events on a number of psychological constructs through a multi-month longitudinal study of hospital and aerospace workers. Through causal inference, we de…
▽ More
Life events can dramatically affect our psychological state and work performance. Stress, for example, has been linked to professional dissatisfaction, increased anxiety, and workplace burnout. We explore the impact of positive and negative life events on a number of psychological constructs through a multi-month longitudinal study of hospital and aerospace workers. Through causal inference, we demonstrate that positive life events increase positive affect, while negative events increase stress, anxiety and negative affect. While most events have a transient effect on psychological states, major negative events, like illness or attending a funeral, can reduce positive affect for multiple days. Next, we assess whether these events can be detected through wearable sensors, which can cheaply and unobtrusively monitor health-related factors. We show that these sensors paired with embedding-based learning models can be used ``in the wild'' to capture atypical life events in hundreds of workers across both datasets. Overall our results suggest that automated interventions based on physiological sensing may be feasible to help workers regulate the negative effects of life events.
△ Less
Submitted 4 August, 2020;
originally announced August 2020.
-
Unequal Impact and Spatial Aggregation Distort COVID-19 Growth Rates
Authors:
Keith Burghardt,
Kristina Lerman
Abstract:
The COVID-19 pandemic has emerged as a global public health crisis. To make decisions about mitigation strategies and to understand the disease dynamics, policy makers and epidemiologists must know how the disease is spreading in their communities. We analyze confirmed infections and deaths over multiple geographic scales to show that COVID-19's impact is highly unequal: many subregions have nearl…
▽ More
The COVID-19 pandemic has emerged as a global public health crisis. To make decisions about mitigation strategies and to understand the disease dynamics, policy makers and epidemiologists must know how the disease is spreading in their communities. We analyze confirmed infections and deaths over multiple geographic scales to show that COVID-19's impact is highly unequal: many subregions have nearly zero infections, and others are hot spots. We attribute the effect to a Reed-Hughes-like mechanism in which disease arrives at different times and grows exponentially. Hot spots, however, appear to grow faster than neighboring subregions and dominate spatially aggregated statistics, thereby amplifying growth rates. The staggered spread of COVID-19 can also make aggregated growth rates appear higher even when subregions grow at the same rate. Public policy, economic analysis and epidemic modeling need to account for potential distortions introduced by spatial aggregation.
△ Less
Submitted 15 May, 2020; v1 submitted 27 April, 2020;
originally announced April 2020.
-
The Emergence of Heterogeneous Scaling in Research Institutions
Authors:
Keith A. Burghardt,
Zihao He,
Allon G. Percus,
Kristina Lerman
Abstract:
Research institutions provide the infrastructure for scientific discovery, yet their role in the production of knowledge is not well characterized. To address this gap, we analyze interactions of researchers within and between institutions from millions of scientific papers. Our analysis reveals that the number of collaborations scales superlinearly with institution size, though at different rates…
▽ More
Research institutions provide the infrastructure for scientific discovery, yet their role in the production of knowledge is not well characterized. To address this gap, we analyze interactions of researchers within and between institutions from millions of scientific papers. Our analysis reveals that the number of collaborations scales superlinearly with institution size, though at different rates (heterogeneous densification). We also find that the number of institutions scales with the number of researchers as a power law (Heaps' law) and institution sizes approximate Zipf's law. These patterns can be reproduced by a simple model with three mechanisms: (i) researchers collaborate with friends-of-friends, (ii) new institutions trigger more potential institutions, and (iii) researchers are preferentially hired by large institutions. This model reveals an economy of scale in research: larger institutions grow faster and amplify collaborations. Our work provides a new understanding of emergent behavior in research institutions and how they facilitate innovation.
△ Less
Submitted 26 January, 2021; v1 submitted 23 January, 2020;
originally announced January 2020.
-
Learning Fair and Interpretable Representations via Linear Orthogonalization
Authors:
Yuzi He,
Keith Burghardt,
Kristina Lerman
Abstract:
To reduce human error and prejudice, many high-stakes decisions have been turned over to machine algorithms. However, recent research suggests that this does not remove discrimination, and can perpetuate harmful stereotypes. While algorithms have been developed to improve fairness, they typically face at least one of three shortcomings: they are not interpretable, their prediction quality deterior…
▽ More
To reduce human error and prejudice, many high-stakes decisions have been turned over to machine algorithms. However, recent research suggests that this does not remove discrimination, and can perpetuate harmful stereotypes. While algorithms have been developed to improve fairness, they typically face at least one of three shortcomings: they are not interpretable, their prediction quality deteriorates quickly compared to unbiased equivalents, and they are not easily transferable across models. To address these shortcomings, we propose a geometric method that removes correlations between data and any number of protected variables. Further, we can control the strength of debiasing through an adjustable parameter to address the trade-off between prediction quality and fairness. The resulting features are interpretable and can be used with many popular models, such as linear regression, random forest, and multilayer perceptrons. The resulting predictions are found to be more accurate and fair compared to several state-of-the-art fair AI algorithms across a variety of benchmark datasets. Our work shows that debiasing data is a simple and effective solution toward improving fairness.
△ Less
Submitted 17 December, 2019; v1 submitted 28 October, 2019;
originally announced October 2019.
-
The Transsortative Structure of Networks
Authors:
Xin-Zeng Wu,
Allon G. Percus,
Keith Burghardt,
Kristina Lerman
Abstract:
Network topologies can be non-trivial, due to the complex underlying behaviors that form them. While past research has shown that some processes on networks may be characterized by low-order statistics describing nodes and their neighbors, such as degree assortativity, these quantities fail to capture important sources of variation in network structure. We introduce a property called transsortativ…
▽ More
Network topologies can be non-trivial, due to the complex underlying behaviors that form them. While past research has shown that some processes on networks may be characterized by low-order statistics describing nodes and their neighbors, such as degree assortativity, these quantities fail to capture important sources of variation in network structure. We introduce a property called transsortativity that describes correlations among a node's neighbors, generalizing these statistics from immediate one-hop neighbors to two-hop neighbors. We describe how transsortativity can be systematically varied, independently of the network's degree distribution and assortativity. Moreover, we show that it can significantly impact the spread of contagions as well as the perceptions of neighbors, known as the majority illusion. Our work improves our ability to create and analyze more realistic models of complex networks.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Quantifying the Impact of Cognitive Biases in Question-Answering Systems
Authors:
Keith Burghardt,
Tad Hogg,
Kristina Lerman
Abstract:
Crowdsourcing can identify high-quality solutions to problems; however, individual decisions are constrained by cognitive biases. We investigate some of these biases in an experimental model of a question-answering system. In both natural and controlled experiments, we observe a strong position bias in favor of answers appearing earlier in a list of choices. This effect is enhanced by three cognit…
▽ More
Crowdsourcing can identify high-quality solutions to problems; however, individual decisions are constrained by cognitive biases. We investigate some of these biases in an experimental model of a question-answering system. In both natural and controlled experiments, we observe a strong position bias in favor of answers appearing earlier in a list of choices. This effect is enhanced by three cognitive factors: the attention an answer receives, its perceived popularity, and cognitive load, measured by the number of choices a user has to process. While separately weak, these effects synergistically amplify position bias and decouple user choices of best answers from their intrinsic quality. We end our paper by discussing the novel ways we can apply these findings to substantially improve how high-quality answers are found in question-answering systems.
△ Less
Submitted 20 September, 2019;
originally announced September 2019.
-
Cascading failures in scale-free interdependent networks
Authors:
Malgorzata Turalska,
Keith Burghardt,
Martin Rohden,
Ananthram Swami,
Raissa M. D'Souza
Abstract:
Large cascades are a common occurrence in many natural and engineered complex systems. In this paper we explore the propagation of cascades across networks using realistic network topologies, such as heterogeneous degree distributions, as well as intra- and interlayer degree correlations. We find that three properties, scale-free degree distribution, internal network assortativity, and cross-netwo…
▽ More
Large cascades are a common occurrence in many natural and engineered complex systems. In this paper we explore the propagation of cascades across networks using realistic network topologies, such as heterogeneous degree distributions, as well as intra- and interlayer degree correlations. We find that three properties, scale-free degree distribution, internal network assortativity, and cross-network hub-to-hub connections, are all necessary components to significantly reduce the size of large cascades in the Bak-Tang-Wiesenfeld sandpile model. We demonstrate that correlations present in the structure of the multilayer network influence the dynamical cascading process and can prevent failures from spreading across connected layers. These findings highlight the importance of internal and cross-network topology in optimizing stability and robustness of interconnected systems.
△ Less
Submitted 19 February, 2019;
originally announced February 2019.
-
Evidence of Herding and Stubbornness in Jury Deliberations
Authors:
Keith Burghardt,
William Rand,
Michelle Girvan
Abstract:
We explore how the mechanics of collective decision-making, especially of jury deliberation, can be inferred from macroscopic statistics. We first hypothesize that the dynamics of competing opinions can leave a "fingerprint" in the joint distribution of final votes and time to reach a decision. We probe this hypothesis by modeling jury datasets from different states collected in different years an…
▽ More
We explore how the mechanics of collective decision-making, especially of jury deliberation, can be inferred from macroscopic statistics. We first hypothesize that the dynamics of competing opinions can leave a "fingerprint" in the joint distribution of final votes and time to reach a decision. We probe this hypothesis by modeling jury datasets from different states collected in different years and identifying which of the models best explains opinion dynamics in juries. In our best-fit model, individual jurors have a "herding" tendency to adopt the majority opinion of the jury, but as the amount of time they have held their current opinion increases, so too does their resistance to changing their opinion (what we call "increasing stubbornness"). By contrast, other models without increasing stubbornness, or without herding, create poorer fits to data. Our findings suggest that both stubbornness and herding play an important role in collective decision-making.
△ Less
Submitted 1 November, 2017; v1 submitted 30 October, 2017;
originally announced October 2017.
-
Dynamics of Content Quality in Collaborative Knowledge Production
Authors:
Emilio Ferrara,
Nazanin Alipourfard,
Keith Burghardt,
Chiranth Gopal,
Kristina Lerman
Abstract:
We explore the dynamics of user performance in collaborative knowledge production by studying the quality of answers to questions posted on Stack Exchange. We propose four indicators of answer quality: answer length, the number of code lines and hyperlinks to external web content it contains, and whether it is accepted by the asker as the most helpful answer to the question. Analyzing millions of…
▽ More
We explore the dynamics of user performance in collaborative knowledge production by studying the quality of answers to questions posted on Stack Exchange. We propose four indicators of answer quality: answer length, the number of code lines and hyperlinks to external web content it contains, and whether it is accepted by the asker as the most helpful answer to the question. Analyzing millions of answers posted over the period from 2008 to 2014, we uncover regular short-term and long-term changes in quality. In the short-term, quality deteriorates over the course of a single session, with each successive answer becoming shorter, with fewer code lines and links, and less likely to be accepted. In contrast, performance improves over the long-term, with more experienced users producing higher quality answers. These trends are not a consequence of data heterogeneity, but rather have a behavioral origin. Our findings highlight the complex interplay between short-term deterioration in performance, potentially due to mental fatigue or attention depletion, and long-term performance improvement due to learning and skill acquisition, and its impact on the quality of user-generated content.
△ Less
Submitted 10 June, 2017;
originally announced June 2017.
-
Self-Organization of Dragon Kings
Authors:
Yuansheng Lin,
Keith Burghardt,
Martin Rohden,
Pierre-André Noël,
Raissa M. D'Souza
Abstract:
The mechanisms underlying cascading failures are often modeled via the paradigm of self-organized criticality. Here we introduce a simple model where nodes self-organize to be either weak or strong to failure which captures the trade-off between degradation and reinforcement of nodes inherent in many network systems. If strong nodes cannot fail, this leads to power law distributions of failure siz…
▽ More
The mechanisms underlying cascading failures are often modeled via the paradigm of self-organized criticality. Here we introduce a simple model where nodes self-organize to be either weak or strong to failure which captures the trade-off between degradation and reinforcement of nodes inherent in many network systems. If strong nodes cannot fail, this leads to power law distributions of failure sizes with so-called "Black Swan" rare events. In contrast, if strong nodes fail once a sufficient fraction of their neighbors fail, this leads to "Dragon Kings", which are massive failures caused by mechanisms distinct from smaller failures. In our model, we find that once an initial failure size is above a critical value, the Dragon King mechanism kicks in, leading to piggybacking system-wide failures. We demonstrate that the size of the initial failed weak cluster predicts the likelihood of a Dragon King event with high accuracy and we develop a simple control strategy which also reveals that a random upgrade can inadvertently make the system more vulnerable. The Dragon Kings observed are self-organized, existing throughout the parameter regime.
△ Less
Submitted 19 December, 2017; v1 submitted 29 May, 2017;
originally announced May 2017.
-
On Quitting: Performance and Practice in Online Game Play
Authors:
Tushar Agarwal,
Keith A. Burghardt,
Kristina Lerman
Abstract:
We study the relationship between performance and practice by analyzing the activity of many players of a casual online game. We find significant heterogeneity in the improvement of player performance, given by score, and address this by dividing players into similar skill levels and segmenting each player's activity into sessions, i.e., sequence of game rounds without an extended break. After dis…
▽ More
We study the relationship between performance and practice by analyzing the activity of many players of a casual online game. We find significant heterogeneity in the improvement of player performance, given by score, and address this by dividing players into similar skill levels and segmenting each player's activity into sessions, i.e., sequence of game rounds without an extended break. After disaggregating data, we find that performance improves with practice across all skill levels. More interestingly, players are more likely to end their session after an especially large improvement, leading to a peak score in their very last game of a session. In addition, success is strongly correlated with a lower quitting rate when the score drops, and only weakly correlated with skill, in line with psychological findings about the value of persistence and "grit": successful players are those who persist in their practice despite lower scores. Finally, we train an epsilon-machine, a type of hidden Markov model, and find a plausible mechanism of game play that can predict player performance and quitting the game. Our work raises the possibility of real-time assessment and behavior prediction that can be used to optimize human performance.
△ Less
Submitted 14 March, 2017;
originally announced March 2017.
-
Testing Modeling Assumptions in the West Africa Ebola Outbreak
Authors:
Keith Burghardt,
Christopher Verzijl,
Junming Huang,
Matthew Ingram,
Binyang Song,
Marie-Pierre Hasne
Abstract:
The Ebola virus in West Africa has infected almost 30,000 and killed over 11,000 people. Recent models of Ebola Virus Disease (EVD) have often made assumptions about how the disease spreads, such as uniform transmissibility and homogeneous mixing within a population. In this paper, we test whether these assumptions are necessarily correct, and offer simple solutions that may improve disease model…
▽ More
The Ebola virus in West Africa has infected almost 30,000 and killed over 11,000 people. Recent models of Ebola Virus Disease (EVD) have often made assumptions about how the disease spreads, such as uniform transmissibility and homogeneous mixing within a population. In this paper, we test whether these assumptions are necessarily correct, and offer simple solutions that may improve disease model accuracy. First, we use data and models of West African migration to show that EVD does not homogeneously mix, but spreads in a predictable manner. Next, we estimate the initial growth rate of EVD within country administrative divisions and find that it significantly decreases with population density. Finally, we test whether EVD strains have uniform transmissibility through a novel statistical test, and find that certain strains appear more often than expected by chance.
△ Less
Submitted 12 October, 2016; v1 submitted 23 June, 2016;
originally announced June 2016.
-
The Myopia of Crowds: A Study of Collective Evaluation on Stack Exchange
Authors:
Keith Burghardt,
Emanuel F. Alsina,
Michelle Girvan,
William Rand,
Kristina Lerman
Abstract:
Crowds can often make better decisions than individuals or small groups of experts by leveraging their ability to aggregate diverse information. Question answering sites, such as Stack Exchange, rely on the "wisdom of crowds" effect to identify the best answers to questions asked by users. We analyze data from 250 communities on the Stack Exchange network to pinpoint factors affecting which answer…
▽ More
Crowds can often make better decisions than individuals or small groups of experts by leveraging their ability to aggregate diverse information. Question answering sites, such as Stack Exchange, rely on the "wisdom of crowds" effect to identify the best answers to questions asked by users. We analyze data from 250 communities on the Stack Exchange network to pinpoint factors affecting which answers are chosen as the best answers. Our results suggest that, rather than evaluate all available answers to a question, users rely on simple cognitive heuristics to choose an answer to vote for or accept. These cognitive heuristics are linked to an answer's salience, such as the order in which it is listed and how much screen space it occupies. While askers appear to depend more on heuristics, compared to voting users, when choosing an answer to accept as the most helpful one, voters use acceptance itself as a heuristic: they are more likely to choose the answer after it is accepted than before that very same answer was accepted. These heuristics become more important in explaining and predicting behavior as the number of available answers increases. Our findings suggest that crowd judgments may become less reliable as the number of answers grow.
△ Less
Submitted 23 February, 2016;
originally announced February 2016.
-
Competing opinions and stubbornness: connecting models to data
Authors:
Keith A. Burghardt,
William Rand,
Michelle Girvan
Abstract:
We introduce a general contagion-like model for competing opinions that includes dynamic resistance to alternative opinions. We show that this model can describe candidate vote distributions, spatial vote correlations, and a slow approach to opinion consensus with sensible parameter values. These empirical properties of large group dynamics, previously understood using distinct models, may be diff…
▽ More
We introduce a general contagion-like model for competing opinions that includes dynamic resistance to alternative opinions. We show that this model can describe candidate vote distributions, spatial vote correlations, and a slow approach to opinion consensus with sensible parameter values. These empirical properties of large group dynamics, previously understood using distinct models, may be different aspects of human behavior that can be captured by a more unified model, such as the one introduced in this paper.
△ Less
Submitted 7 March, 2016; v1 submitted 26 November, 2014;
originally announced November 2014.
-
The Spectrum Of Hypercubes Quotiented By Doubly Even Codewords And The Thermodynamics Of Adinkras
Authors:
Keith Burghardt
Abstract:
In a previous paper, a solution to the problem of determining isomorphism classes of Lie algebra representations was explored using graphs called adinkras and subgraphs called baobabs arXiv:1306.0550[math-ph] In this paper, I show that adinkras contain Shannon entropy and a latent heat from the information stored in their associated baobabs. In Garden algebra, both properties are closely related t…
▽ More
In a previous paper, a solution to the problem of determining isomorphism classes of Lie algebra representations was explored using graphs called adinkras and subgraphs called baobabs arXiv:1306.0550[math-ph] In this paper, I show that adinkras contain Shannon entropy and a latent heat from the information stored in their associated baobabs. In Garden algebra, both properties are closely related to the spectrum of hypercubes quotiented by doubly even codewords, which is introduced in this paper.
△ Less
Submitted 16 September, 2013;
originally announced September 2013.
-
Creating Infinitesimal Generators And Robust Messages With Adinkras
Authors:
Keith Burghardt
Abstract:
Adinkras are graphs that can describe off-shell supermultiplets in 1 dimension with a Lie superalgebra known as Garden algebra. In this paper, I show that the degrees of freedom of the adinkra can be represented by a subgraph called a baobab. Because the structure of adinkras and baobabs are very general, I will show that all finite-dimensional Lie superalgebras can be similarly described by more…
▽ More
Adinkras are graphs that can describe off-shell supermultiplets in 1 dimension with a Lie superalgebra known as Garden algebra. In this paper, I show that the degrees of freedom of the adinkra can be represented by a subgraph called a baobab. Because the structure of adinkras and baobabs are very general, I will show that all finite-dimensional Lie superalgebras can be similarly described by more general Lie adinkras and Lie baobabs. Furthermore, it will be shown that adinkras can represent forward error correction block codes, and bit erasures in Garden algebra adinkras can be corrected using logic circuits derived from baobabs.
△ Less
Submitted 15 September, 2013; v1 submitted 3 June, 2013;
originally announced June 2013.
-
Adinkra Isomorphisms and `Seeing' Shapes with Eigenvalues
Authors:
Keith Burghardt,
S. James Gates Jr
Abstract:
We create an algorithm to determine whether any two graphical representations (adinkras) of equations possessing the property of supersymmetry in one or two dimensions are isomorphic in shape. The algorithm is based on the determinant of `permutation matrices' that are defined in this work and derivable for any adinkra.
We create an algorithm to determine whether any two graphical representations (adinkras) of equations possessing the property of supersymmetry in one or two dimensions are isomorphic in shape. The algorithm is based on the determinant of `permutation matrices' that are defined in this work and derivable for any adinkra.
△ Less
Submitted 12 December, 2012;
originally announced December 2012.
-
A Computer Algorithm For Engineering Off-Shell Multiplets With Four Supercharges On The World Sheet
Authors:
K. Burghardt,
S. J. Gates Jr
Abstract:
We present an adinkra-based computer algorithm implemented in a Mathematica code and use it in a limited demonstration of how to engineer off-shell, arbitrary N-extended world-sheet supermultiplets. Using one of the outputs from this algorithm, we present evidence for the unexpected discovery of a previously unknown 8 - 8 representation of N = 2 world sheet supersymmetry. As well, we uncover a men…
▽ More
We present an adinkra-based computer algorithm implemented in a Mathematica code and use it in a limited demonstration of how to engineer off-shell, arbitrary N-extended world-sheet supermultiplets. Using one of the outputs from this algorithm, we present evidence for the unexpected discovery of a previously unknown 8 - 8 representation of N = 2 world sheet supersymmetry. As well, we uncover a menagerie of (p, q) = (3, 1) world sheet supermultiplets.
△ Less
Submitted 17 October, 2012; v1 submitted 22 September, 2012;
originally announced September 2012.
-
SUSY Equation Topology, Zonohedra, and the Search for Alternate Off-Shell Adinkras
Authors:
Keith Burghardt,
S. James Gates Jr
Abstract:
Results are given from a search to form adinkra-like equations based on topologies that are not hypercubes. An alternate class of zonohedra topologies are used to construct adinkra-like graphs. In particular, the rhombic dodecahedron and rhombic icosahedron are studied in detail. Using these topological skeletons, equations similar to those of a supersymmetric system are found. But these fail to h…
▽ More
Results are given from a search to form adinkra-like equations based on topologies that are not hypercubes. An alternate class of zonohedra topologies are used to construct adinkra-like graphs. In particular, the rhombic dodecahedron and rhombic icosahedron are studied in detail. Using these topological skeletons, equations similar to those of a supersymmetric system are found. But these fail to have the interpretation of an off-shell supersymmetric system of equations.
△ Less
Submitted 17 October, 2012; v1 submitted 31 December, 2011;
originally announced January 2012.