Search | arXiv e-print repository

The Misuse of AUC: What High Impact Risk Assessment Gets Wrong

Authors: Kweku Kwegyir-Aggrey, Marissa Gerchick, Malika Mohan, Aaron Horowitz, Suresh Venkatasubramanian

Abstract: When determining which machine learning model best performs some high impact risk assessment task, practitioners commonly use the Area under the Curve (AUC) to defend and validate their model choices. In this paper, we argue that the current use and understanding of AUC as a model performance metric misunderstands the way the metric was intended to be used. To this end, we characterize the misuse… ▽ More When determining which machine learning model best performs some high impact risk assessment task, practitioners commonly use the Area under the Curve (AUC) to defend and validate their model choices. In this paper, we argue that the current use and understanding of AUC as a model performance metric misunderstands the way the metric was intended to be used. To this end, we characterize the misuse of AUC and illustrate how this misuse negatively manifests in the real world across several risk assessment domains. We locate this disconnect in the way the original interpretation of AUC has shifted over time to the point where issues pertaining to decision thresholds, class balance, statistical uncertainty, and protected groups remain unaddressed by AUC-based model comparisons, and where model choices that should be the purview of policymakers are hidden behind the veil of mathematical rigor. We conclude that current model validation practices involving AUC are not robust, and often invalid. △ Less

Submitted 29 May, 2023; originally announced May 2023.

arXiv:2203.07490 [pdf, other]

Repairing Regressors for Fair Binary Classification at Any Decision Threshold

Authors: Kweku Kwegyir-Aggrey, A. Feder Cooper, Jessica Dai, John Dickerson, Keegan Hines, Suresh Venkatasubramanian

Abstract: We study the problem of post-processing a supervised machine-learned regressor to maximize fair binary classification at all decision thresholds. By decreasing the statistical distance between each group's score distributions, we show that we can increase fair performance across all thresholds at once, and that we can do so without a large decrease in accuracy. To this end, we introduce a formal m… ▽ More We study the problem of post-processing a supervised machine-learned regressor to maximize fair binary classification at all decision thresholds. By decreasing the statistical distance between each group's score distributions, we show that we can increase fair performance across all thresholds at once, and that we can do so without a large decrease in accuracy. To this end, we introduce a formal measure of Distributional Parity, which captures the degree of similarity in the distributions of classifications for different protected groups. Our main result is to put forward a novel post-processing algorithm based on optimal transport, which provably maximizes Distributional Parity, thereby attaining common notions of group fairness like Equalized Odds or Equal Opportunity at all thresholds. We demonstrate on two fairness benchmarks that our technique works well empirically, while also outperforming and generalizing similar techniques from related work. △ Less

Submitted 10 December, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

arXiv:2104.00606 [pdf, other]

Model Selection's Disparate Impact in Real-World Deep Learning Applications

Authors: Jessica Zosa Forde, A. Feder Cooper, Kweku Kwegyir-Aggrey, Chris De Sa, Michael Littman

Abstract: Algorithmic fairness has emphasized the role of biased data in automated decision outcomes. Recently, there has been a shift in attention to sources of bias that implicate fairness in other stages in the ML pipeline. We contend that one source of such bias, human preferences in model selection, remains under-explored in terms of its role in disparate impact across demographic groups. Using a deep… ▽ More Algorithmic fairness has emphasized the role of biased data in automated decision outcomes. Recently, there has been a shift in attention to sources of bias that implicate fairness in other stages in the ML pipeline. We contend that one source of such bias, human preferences in model selection, remains under-explored in terms of its role in disparate impact across demographic groups. Using a deep learning model trained on real-world medical imaging data, we verify our claim empirically and argue that choice of metric for model comparison, especially those that do not take variability into account, can significantly bias model selection outcomes. △ Less

Submitted 7 September, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: Science and Engineering of Deep Learning Workshop, ICLR 2021

arXiv:2102.10349 [pdf, other]

Everything is Relative: Understanding Fairness with Optimal Transport

Authors: Kweku Kwegyir-Aggrey, Rebecca Santorella, Sarah M. Brown

Abstract: To study discrimination in automated decision-making systems, scholars have proposed several definitions of fairness, each expressing a different fair ideal. These definitions require practitioners to make complex decisions regarding which notion to employ and are often difficult to use in practice since they make a binary judgement a system is fair or unfair instead of explaining the structure of… ▽ More To study discrimination in automated decision-making systems, scholars have proposed several definitions of fairness, each expressing a different fair ideal. These definitions require practitioners to make complex decisions regarding which notion to employ and are often difficult to use in practice since they make a binary judgement a system is fair or unfair instead of explaining the structure of the detected unfairness. We present an optimal transport-based approach to fairness that offers an interpretable and quantifiable exploration of bias and its structure by comparing a pair of outcomes to one another. In this work, we use the optimal transport map to examine individual, subgroup, and group fairness. Our framework is able to recover well known examples of algorithmic discrimination, detect unfairness when other metrics fail, and explore recourse opportunities. △ Less

Submitted 20 February, 2021; originally announced February 2021.

Showing 1–4 of 4 results for author: Kwegyir-Aggrey, K