-
County-level Algorithmic Audit of Racial Bias in Twitter's Home Timeline
Authors:
Luca Belli,
Kyra Yee,
Uthaipon Tantipongpipat,
Aaron Gonzales,
Kristian Lum,
Moritz Hardt
Abstract:
We report on the outcome of an audit of Twitter's Home Timeline ranking system. The goal of the audit was to determine if authors from some racial groups experience systematically higher impression counts for their Tweets than others. A central obstacle for any such audit is that Twitter does not ordinarily collect or associate racial information with its users, thus prohibiting an analysis at the…
▽ More
We report on the outcome of an audit of Twitter's Home Timeline ranking system. The goal of the audit was to determine if authors from some racial groups experience systematically higher impression counts for their Tweets than others. A central obstacle for any such audit is that Twitter does not ordinarily collect or associate racial information with its users, thus prohibiting an analysis at the level of individual authors. Working around this obstacle, we take US counties as our unit of analysis. We associate each user in the United States on the Twitter platform to a county based on available location data. The US Census Bureau provides information about the racial decomposition of the population in each county. The question we investigate then is if the racial decomposition of a county is associated with the visibility of Tweets originating from within the county. Focusing on two racial groups, the Black or African American population and the White population as defined by the US Census Bureau, we evaluate two statistical measures of bias. Our investigation represents the first large-scale algorithmic audit into racial bias on the Twitter platform. Additionally, it illustrates the challenges of measuring racial bias in online platforms without having such information on the users.
△ Less
Submitted 10 February, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter
Authors:
Kyra Yee,
Alice Schoenauer Sebag,
Olivia Redfield,
Emily Sheng,
Matthias Eck,
Luca Belli
Abstract:
Harmful content detection models tend to have higher false positive rates for content from marginalized groups. In the context of marginal abuse modeling on Twitter, such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion on the platform. Current approaches to algorithmic harm mitigation, and bias detection…
▽ More
Harmful content detection models tend to have higher false positive rates for content from marginalized groups. In the context of marginal abuse modeling on Twitter, such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion on the platform. Current approaches to algorithmic harm mitigation, and bias detection for NLP models are often very ad hoc and subject to human bias. We make two main contributions in this paper. First, we design a novel methodology, which provides a principled approach to detecting and measuring the severity of potential harms associated with a text-based model. Second, we apply our methodology to audit Twitter's English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content. Without utilizing demographic labels or dialect classifiers, we are still able to detect and measure the severity of issues related to the over-penalization of the speech of marginalized communities, such as the use of reclaimed speech, counterspeech, and identity related terms. In order to mitigate the associated harms, we experiment with adding additional true negative examples and find that doing so provides improvements to our fairness metrics without large degradations in model performance.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Random Isn't Always Fair: Candidate Set Imbalance and Exposure Inequality in Recommender Systems
Authors:
Amanda Bower,
Kristian Lum,
Tomo Lazovich,
Kyra Yee,
Luca Belli
Abstract:
Traditionally, recommender systems operate by returning a user a set of items, ranked in order of estimated relevance to that user. In recent years, methods relying on stochastic ordering have been developed to create "fairer" rankings that reduce inequality in who or what is shown to users. Complete randomization -- ordering candidate items randomly, independent of estimated relevance -- is large…
▽ More
Traditionally, recommender systems operate by returning a user a set of items, ranked in order of estimated relevance to that user. In recent years, methods relying on stochastic ordering have been developed to create "fairer" rankings that reduce inequality in who or what is shown to users. Complete randomization -- ordering candidate items randomly, independent of estimated relevance -- is largely considered a baseline procedure that results in the most equal distribution of exposure. In industry settings, recommender systems often operate via a two-step process in which candidate items are first produced using computationally inexpensive methods and then a full ranking model is applied only to those candidates.
In this paper, we consider the effects of inequality at the first step and show that, paradoxically, complete randomization at the second step can result in a higher degree of inequality relative to deterministic ordering of items by estimated relevance scores. In light of this observation, we then propose a simple post-processing algorithm in pursuit of reducing exposure inequality that works both when candidate sets have a high level of imbalance and when they do not. The efficacy of our method is illustrated on both simulated data and a common benchmark data set used in studying fairness in recommender systems.
△ Less
Submitted 11 September, 2022;
originally announced September 2022.
-
Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics
Authors:
Tomo Lazovich,
Luca Belli,
Aaron Gonzales,
Amanda Bower,
Uthaipon Tantipongpipat,
Kristian Lum,
Ferenc Huszar,
Rumman Chowdhury
Abstract:
The harmful impacts of algorithmic decision systems have recently come into focus, with many examples of systems such as machine learning (ML) models amplifying existing societal biases. Most metrics attempting to quantify disparities resulting from ML algorithms focus on differences between groups, dividing users based on demographic identities and comparing model performance or overall outcomes…
▽ More
The harmful impacts of algorithmic decision systems have recently come into focus, with many examples of systems such as machine learning (ML) models amplifying existing societal biases. Most metrics attempting to quantify disparities resulting from ML algorithms focus on differences between groups, dividing users based on demographic identities and comparing model performance or overall outcomes between these groups. However, in industry settings, such information is often not available, and inferring these characteristics carries its own risks and biases. Moreover, typical metrics that focus on a single classifier's output ignore the complex network of systems that produce outcomes in real-world settings. In this paper, we evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in a production recommendation system, the Twitter algorithmic timeline. We define desirable criteria for metrics to be used in an operational setting, specifically by ML practitioners. We characterize different types of engagement with content on Twitter using these metrics, and use these results to evaluate the metrics with respect to the desired criteria. We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users. Overall, we conclude that these metrics can be useful tools for understanding disparate outcomes in online social networks.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
Algorithmic Amplification of Politics on Twitter
Authors:
Ferenc Huszár,
Sofia Ira Ktena,
Conor O'Brien,
Luca Belli,
Andrew Schlaikjer,
Moritz Hardt
Abstract:
Content on Twitter's home timeline is selected and ordered by personalization algorithms. By consistently ranking certain content higher, these algorithms may amplify some messages while reducing the visibility of others. There's been intense public and scholarly debate about the possibility that some political groups benefit more from algorithmic amplification than others. We provide quantitative…
▽ More
Content on Twitter's home timeline is selected and ordered by personalization algorithms. By consistently ranking certain content higher, these algorithms may amplify some messages while reducing the visibility of others. There's been intense public and scholarly debate about the possibility that some political groups benefit more from algorithmic amplification than others. We provide quantitative evidence from a long-running, massive-scale randomized experiment on the Twitter platform that committed a randomized control group including nearly 2M daily active accounts to a reverse-chronological content feed free of algorithmic personalization. We present two sets of findings. First, we studied Tweets by elected legislators from major political parties in 7 countries. Our results reveal a remarkably consistent trend: In 6 out of 7 countries studied, the mainstream political right enjoys higher algorithmic amplification than the mainstream political left. Consistent with this overall trend, our second set of findings studying the U.S. media landscape revealed that algorithmic amplification favours right-leaning news sources. We further looked at whether algorithms amplify far-left and far-right political groups more than moderate ones: contrary to prevailing public belief, we did not find evidence to support this hypothesis. We hope our findings will contribute to an evidence-based debate on the role personalization algorithms play in sha** political content consumption.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
The 2021 RecSys Challenge Dataset: Fairness is not optional
Authors:
Luca Belli,
Alykhan Tejani,
Frank Portman,
Alexandre Lung-Yut-Fong,
Ben Chamberlain,
Yuanpu Xie,
Kristian Lum,
Jonathan Hunt,
Michael Bronstein,
Vito Walter Anelli,
Saikishore Kalloori,
Bruce Ferwerda,
Wenzhe Shi
Abstract:
After the success the RecSys 2020 Challenge, we are describing a novel and bigger dataset that was released in conjunction with the ACM RecSys Challenge 2021. This year's dataset is not only bigger (~ 1B data points, a 5 fold increase), but for the first time it take into consideration fairness aspects of the challenge. Unlike many static datsets, a lot of effort went into making sure that the dat…
▽ More
After the success the RecSys 2020 Challenge, we are describing a novel and bigger dataset that was released in conjunction with the ACM RecSys Challenge 2021. This year's dataset is not only bigger (~ 1B data points, a 5 fold increase), but for the first time it take into consideration fairness aspects of the challenge. Unlike many static datsets, a lot of effort went into making sure that the dataset was synced with the Twitter platform: if a user deleted their content, the same content would be promptly removed from the dataset too. In this paper, we introduce the dataset and challenge, highlighting some of the issues that arise when creating recommender systems at Twitter scale.
△ Less
Submitted 21 September, 2021; v1 submitted 16 September, 2021;
originally announced September 2021.
-
Causal Inference Struggles with Agency on Online Platforms
Authors:
Smitha Milli,
Luca Belli,
Moritz Hardt
Abstract:
Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users…
▽ More
Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate a "Catch-22" that suggests that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.
△ Less
Submitted 10 May, 2022; v1 submitted 19 July, 2021;
originally announced July 2021.
-
From Optimizing Engagement to Measuring Value
Authors:
Smitha Milli,
Luca Belli,
Moritz Hardt
Abstract:
Most recommendation engines today are based on predicting user engagement, e.g. predicting whether a user will click on an item or not. However, there is potentially a large gap between engagement signals and a desired notion of "value" that is worth optimizing for. We use the framework of measurement theory to (a) confront the designer with a normative question about what the designer values, (b)…
▽ More
Most recommendation engines today are based on predicting user engagement, e.g. predicting whether a user will click on an item or not. However, there is potentially a large gap between engagement signals and a desired notion of "value" that is worth optimizing for. We use the framework of measurement theory to (a) confront the designer with a normative question about what the designer values, (b) provide a general latent variable model approach that can be used to operationalize the target construct and directly optimize for it, and (c) guide the designer in evaluating and revising their operationalization. We implement our approach on the Twitter platform on millions of users. In line with established approaches to assessing the validity of measurements, we perform a qualitative evaluation of how well our model captures a desired notion of "value".
△ Less
Submitted 19 July, 2021; v1 submitted 20 August, 2020;
originally announced August 2020.
-
Assessing Demographic Bias in Named Entity Recognition
Authors:
Shubhanshu Mishra,
Sijun He,
Luca Belli
Abstract:
Named Entity Recognition (NER) is often the first step towards automated Knowledge Base (KB) generation from raw text. In this work, we assess the bias in various Named Entity Recognition (NER) systems for English across different demographic groups with synthetically generated corpora. Our analysis reveals that models perform better at identifying names from specific demographic groups across two…
▽ More
Named Entity Recognition (NER) is often the first step towards automated Knowledge Base (KB) generation from raw text. In this work, we assess the bias in various Named Entity Recognition (NER) systems for English across different demographic groups with synthetically generated corpora. Our analysis reveals that models perform better at identifying names from specific demographic groups across two datasets. We also identify that debiased embeddings do not help in resolving this issue. Finally, we observe that character-based contextualized word representation models such as ELMo results in the least bias across demographics. Our work can shed light on potential biases in automated KB generation due to systematic exclusion of named entities belonging to certain demographics.
△ Less
Submitted 7 August, 2020;
originally announced August 2020.
-
Privacy-Aware Recommender Systems Challenge on Twitter's Home Timeline
Authors:
Luca Belli,
Sofia Ira Ktena,
Alykhan Tejani,
Alexandre Lung-Yut-Fong,
Frank Portman,
Xiao Zhu,
Yuanpu Xie,
Akshay Gupta,
Michael Bronstein,
Amra Delić,
Gabriele Sottocornola,
Walter Anelli,
Nazareno Andrade,
Jessie Smith,
Wenzhe Shi
Abstract:
Recommender systems constitute the core engine of most social network platforms nowadays, aiming to maximize user satisfaction along with other key business objectives. Twitter is no exception. Despite the fact that Twitter data has been extensively used to understand socioeconomic and political phenomena and user behaviour, the implicit feedback provided by users on Tweets through their engagemen…
▽ More
Recommender systems constitute the core engine of most social network platforms nowadays, aiming to maximize user satisfaction along with other key business objectives. Twitter is no exception. Despite the fact that Twitter data has been extensively used to understand socioeconomic and political phenomena and user behaviour, the implicit feedback provided by users on Tweets through their engagements on the Home Timeline has only been explored to a limited extent. At the same time, there is a lack of large-scale public social network datasets that would enable the scientific community to both benchmark and build more powerful and comprehensive models that tailor content to user interests. By releasing an original dataset of 160 million Tweets along with engagement information, Twitter aims to address exactly that. During this release, special attention is drawn on maintaining compliance with existing privacy laws. Apart from user privacy, this paper touches on the key challenges faced by researchers and professionals striving to predict user engagements. It further describes the key aspects of the RecSys 2020 Challenge that was organized by ACM RecSys in partnership with Twitter using this dataset.
△ Less
Submitted 7 October, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.
-
Fighting Redundancy and Model Decay with Embeddings
Authors:
Dan Shiebler,
Luca Belli,
Jay Baxter,
Hanchen Xiong,
Abhishek Tayal
Abstract:
Every day, hundreds of millions of new Tweets containing over 40 languages of ever-shifting vernacular flow through Twitter. Models that attempt to extract insight from this firehose of information must face the torrential covariate shift that is endemic to the Twitter platform. While regularly-retrained algorithms can maintain performance in the face of this shift, fixed model features that fail…
▽ More
Every day, hundreds of millions of new Tweets containing over 40 languages of ever-shifting vernacular flow through Twitter. Models that attempt to extract insight from this firehose of information must face the torrential covariate shift that is endemic to the Twitter platform. While regularly-retrained algorithms can maintain performance in the face of this shift, fixed model features that fail to represent new trends and tokens can quickly become stale, resulting in performance degradation. To mitigate this problem we employ learned features, or embedding models, that can efficiently represent the most relevant aspects of a data distribution. Sharing these embedding models across teams can also reduce redundancy and multiplicatively increase cross-team modeling productivity. In this paper, we detail the commoditized tools, algorithms and pipelines that we have developed and are develo** at Twitter to regularly generate high quality, up-to-date embeddings and share them broadly across the company.
△ Less
Submitted 18 September, 2018;
originally announced September 2018.