footnote
Flexible Fairness Learning via Inverse Conditional Permutation
Abstract
Equalized odds, as a popular notion of algorithmic fairness, aims to ensure that sensitive variables, such as race and gender, do not unfairly influence the algorithm prediction when conditioning on the true outcome. Despite rapid advancements, most of the current research focuses on the violation of equalized odds caused by one sensitive attribute, leaving the challenge of simultaneously accounting for multiple attributes under-addressed. We address this gap by introducing a fairness learning approach that integrates adversarial learning with a novel inverse conditional permutation. This approach effectively and flexibly handles multiple sensitive attributes, potentially of mixed data types. The efficacy and flexibility of our method are demonstrated through both simulation studies and empirical analysis of real-world datasets.
1 Introduction
Machine learning models have become important tools for aiding decision-making in various applications. One of the challenges in applying machine learning is ensuring that the models are fair, i.e., they do not discriminate against minorities or other protected groups (Mehrabi et al., 2021). Several fairness concepts have been developed in the literature to address different practical needs (Mehrabi et al., 2021; Castelnovo et al., 2022). In this work, we consider the equalized odds criterion (Hardt et al., 2016), defined as
(1) |
Here, is the response variable, is the sensitive attribute(s) that we care to protect(e.g. gender / race / income), and is the prediction given by any model. Notice that, when drop** the conditional term, (1) becomes unconditional independence as to accommodate the need of demographic parity. The requirements for a learned model to satisfy certain independence relations are not limited to the realm of fairness: in this broader context, statisticians have long been working on robust inference techniques based on the concept of a pivot – a quantity whose distribution is invariant with respect to the nuisance parameters (see, e.g., Keener (2010)).
Despite the exciting progress, most existing algorithms aiming for equalized odds can only handle one protected attribute. However, in fields including clinical research, there is a growing need to mitigate biases related to multiple sensitive attributes (Yang et al., 2022). It has also been pointed out that fairness gerrymandering can occur when algorithmic decision-making considers only a single sensitive attribute at a time (Kearns et al., 2018). Additionally, the equalized-odds problem in the context of continuous sensitive attributes is much less explored.
Here, we alleviate these two limitations by proposing a versatile equalized odds training scheme, FairICP, as illustrated by Figure 1: Building on the sensitive attribute resampling framework (Romano et al., 2020), we generate using a novel inverse conditional permutation (ICP) strategy, conditional permutations of given , and construct a more fair model through regularizing the distribution of toward the distribution of (see Figure 1). Our contributions are summarized as below.
-
•
We propose a novel inverse conditional permutation (ICP) strategy to generate , conditional permutations of , without estimating the multi-dimensional conditional density of .
-
•
We show theoretically that the equalized odds condition holds asymptotically for when the is generated according to ICP.
-
•
We propose examining the fairness level with a recently developed non-parametric conditional dependence measure.
-
•
We demonstrate experimentally that FairICP enjoys improved efficacy and flexibility.
Related work
Existing fairness concepts can be divided into different categories, including statistical/group fairness (Hardt et al., 2016; Zafar et al., 2017), which aims to ensure similar predictions across different groups; individual fairness (Dwork et al., 2012), which targets similar predictions for similar individuals; and causality-based fairness (Kusner et al., 2017), which tries to reveal causal relationships. More comprehensive discussions can be found in (Mehrabi et al., 2021; Castelnovo et al., 2022). Prominent statistical fairness measures include demographic parity (Zafar et al., 2019), equal opportunity (Hardt et al., 2016), and equalized odds (Hardt et al., 2016), which can all be articulated as (conditional) independence relations from a statistical perspective. Given the fairness concept, the associated procedures can be generally categorized into three types: (1) pre-processing, (2) post-processing, and (3) in-processing. Pre-processing aims to correct potentially biased data before any model fitting procedures (Zemel et al., 2013; Feldman et al., 2015), while post-processing modifies the classifier’s output at the test phase, leaving the model unchanged (Hardt et al., 2016; Kim et al., 2018; Hebert-Johnson et al., 2018).
FairICP is an in-processing method that encourages equalized-odds fairness for multiple complex sensitive attributes during model training. Several in-processing methods have been previously introduced to address the violation of equalized odds. For example, Agarwal et al. (2018) describes a procedure for handling categorical sensitive attributes for binary classification. Mary et al. (2019) trains a model that penalizes the violation of equalized-odds, measured by the Hirschfeld-Gebelein-Rényi (HGR) Maximum Correlation Coefficient, and is designed to reduce equalized-odds violations in the presence of one sensitive attribute, whether categorical or continuous. Closely related to FairICP, another line of in-processing algorithms encourages fairness using an adversarial loss designed for different fairness metrics (Zhang et al., 2018). Particularly, Romano et al. (2020) proposes a novel adversarial learning loss that utilizes the resampled synthetic variable from the conditional distribution of a potentially continuous conditional on . Although the joint consideration of multiple sensitive attributes has been explored for demographic parity under this framework (Creager et al., 2019), jointly modeling multiple sensitive attributes, especially continuous ones, remains an unresolved challenge. This challenge is largely due to the difficulty of estimating the conditional density of . Our approach shares similar loss designs with that of Romano et al. (2020) but employs a novel permutation technique capable of handling multiple and complex protected variables.
2 Method
We propose a general adversarial learning procedure to obtain models with improved equalized odds guarantee through utilizing a novel Inverse Conditional Permutation (ICP). The proposed procedure FairICP enables efficient fairness learning with multi-dimensional sensitive attributes with either categorical or continuous response . Before describing our proposal, we first define some notations used throughout this paper. We will also review the framework of model training with equalized odds penalty based on sensitive attribute re-sampling Romano et al. (2020) and the challenge in applying sensitive attribute re-sampling and existing methods for multidimensional attributes, which motivates our proposal.
Let for be i.i.d. generated triples of (feature, sensitive attribute, response). Let be a prediction function with model parameter . Although can be any prediction that is differentiable in , we will consider as the neural network throughout this work. Let be the prediction for given . For a regression problem, is the predicted value of the continuous response ; for a classification problem, the last layer of is a softmax layer and is the predicted probability vector for being in each class. We also denote , and .
2.1 Fairness-learning via sensitive attribute re-sampling
We first present the framework (Romano et al., 2020) denoted by Fair Dummies Learning (FDL) to achieve equalized odds for one sensitive attribute. Our terminology will differ somewhat from the terminology used in this reference, to help us introduce the new perspectives and frameworks in this paper later on.
To evaluate the potential violation of equalized odds (1) in prediction , FDL construct a resampled version of the original sensitive attribute as to be a contrast and sample according to , where , and denotes the conditional distribution of given . Since we generate without looking at , the following equalized odds property holds: . Hence, we can measure the degree of violation to the equalized odds condition by measuring the discrepancy between the distribution of and the distribution of . Following this intuition, FDL utilizes GAN (Goodfellow et al., 2014), to iteratively learn how to separate the two distributions and optimize a fairness-regularized prediction loss. More specifically, define
(2) | |||
(3) | |||
(4) |
as the expected negative log-likelihood loss, the discriminator loss, and value function respectively, where is the classifier which separates and , and is a tuning parameter that controls the prediction-fairness trade-off. Then, FDL learns by finding the minimax solution
(5) |
FDL generates through Conditional Randomization (CR) (Candès et al., 2018), which is done by re-sampling it from its (estimated) conditional distribution given other variables that we want to control for. However, the effectiveness of conditional randomization requires estimation of , which is challenging when is multi-dimensional (Scott, 1991). This challenge is not unique to FDL and needs to be addressed for other non-resampling-based approaches such as Holdout Randomization Test (HRT) (Tansey et al., 2022) as well. In addition, the sensitive attributes can also potentially be both discrete and continuous, which adds another layer of the challenge of estimating . An approach allows to have flexible types and scales well with the dimension of to help the promotion of fairness learning in many social and medical applications.
2.2 Fairness learning via ICP
To circumvent the challenge in learning the conditional density of given , we pivot to estimate given and leverage Conditional Permutation (CP) (Berrett et al., 2020) to generate a permuted version of which also has the property of equalized odds (1) asymptotically.
CP in fairness learning.
To begin with, we first introduce the vanilla CP strategy to generate permutation copies in Berrett et al. (2020) in our setting.
Let denote the set of permutations on the indices . Given any vector and any permutation , define as permuted version of with its entries reordered according to the permutation . Instead of drawing a permutation uniformly at random, CP assigns unequal sampling probability to permutations based on the conditional probability of observing given :
(6) |
Here we let be the density of the distribution (i.e., is the conditional density of given ). We write to denote the product density. This leads to the synthetic , which, intuitively, should have low dependence on given , and can thus be utilized to encourage equalized odds as described in (1).
ICP circumvents density estimation of .
Unfortunately, conducting conditional permutation with multivariate relies on conditional density estimation of given and does not alleviate the issue arising from multivariate density estimation as we mentioned earlier. To circumvent this problem, we propose a simple ICP (inverse conditional permutation) strategy which is indirect yet scales better with the dimensionality of and can adapt easily to various data types of .
ICP begins with the observation that the distribution of is identical as the distribution of . Hence, intuitively, instead of determining based on the conditional law of given , we first consider the conditional permutation of given , which is one dimensional and can be estimated conveniently using standard regression or generalized regression techniques regardless of the complexity in . We then generate by applying an inverse operator to the distribution of these permutations. Specifically, we generate with the following probabilities:
(7) |
Indeed, this intuition helps us to which can be used to monitor the violation of the equalized odds condition.
Theorem 2.1.
For any observations , let be generated by the ICP sampling scheme (7). Let denote the unordered set of rows in , and let be the dimension of . We have
(1) If , then .
(2) If , then . Further, when , the asymptotic equalized odds condition holds: for any constant vectors and ,
Remark 2.2.
In FDL, the availability of accurate conditional density enables the equivalence between and , ICP pays an almost negligible price and offers a fast-rate asymptotic equivalence but circumvents the density estimation of .
Motivated by this, we propose an adversarial learning procedure utilizing the permuted sensitive attributes from the ICP sampling scheme (7), which is built under the same formulation of the loss function shown previously in Section 2.1. Let and be the empirical realizations of the losses , and defined in (2) and (3) respectively. Algorithm 1 presents the details. We detail the permutation sampling algorithm, Parallelized pairwise sampler, in Appendix B for the sake of completeness, which is adapted from Berrett et al. (2020).
Theorem 2.3.
If there exists a minimax solution for defined in (5) such that , where denotes the conditional entropy, then is both an optimal and fair predictor, which simultaneously minimizes and satisfies equalized odds simultaneously.
Input: Data
Parameters: penalty weight , step size , number of gradient steps , and iterations .
Output: predictive model and discriminator .
Output: Predictive model .
In practice, the assumption of the existence of an optimal and fair predictor (in terms of equalized odds) may not hold (Tang and Zhang, 2022). Setting to a large value will preferably enforce to satisfy equalized odds while setting close to 0 will push to be optimal: an increase in accuracy would often be accompanied by a decrease in fairness and vice-versa.
2.3 Density Estimation
The estimation of conditional densities is a crucial part of both our method and previous work (Romano et al., 2020; Mary et al., 2019). However, unlike the previous work which requires the estimation of , our proposal looks into the inverse relationship of . In practice, our proposed method can easily leverage the state-of-the-art density estimator and is less disturbed by the increased complexity in , due to either dimension or data types.
In this manuscript, we applied Masked Autoregressive Flow (MAF) (Papamakarios et al., 2017) to estimate the conditional density of when is continuous and can take arbitrary data types (discrete or continuous) 111In MAF paper (Papamakarios et al., 2017), to estimate , is assumed to be continuous while can take arbitrary form, but there’s no requirements about the dimensionality of and . In classification scenario when , one can always fit a classifier to model . To this end, FairICP is more feasible to handle more complex sensitive attributes and is suitable for both regression and classification tasks. To provide more theoretical and empirical insights into how the quality of density estimation affects CP and ICP, we have additional analysis in Appendix C.
3 Measuring the violation of equalized odds
To gain a reliable understanding of the potential violation of equalized odds using the trained model , we carry out a disciplined evaluation utilizing an untouched test set and a recently proposed conditional independence measure.
3.1 Measure of Conditional Dependence
From a statistical point of view, we note that equalized odds (1) is exactly the notion of conditional independence. Thus, measuring the violation of equalized odds is equivalent to measuring conditional independence, and there have been some works trying to bridge these two problems (Mary et al., 2019; Kamishima et al., 2011; Romano et al., 2020).
In Mary et al. (2019), Hirschfeld-Gebelein-Renyi Maximum Correlation Coefficient (HGR) is chosen to measure the conditional dependence for equalized odds and used as a penalty term to fit a fair model. However, the estimation of HGR, which is based on kernel density estimation of , becomes difficult when is multivariate. Here, we take advantage of recent developments in conditional dependence measures and link them to our problem by introducing a flexible measure proposed by Huang et al. (2022).
Definition 3.1.
Kernel Partial Correlation (KPC) coefficient is defined as:
where and is supported on a subset of some topological space , MMD is the maximum mean discrepancy - a distance metric between two probability distributions depending on the characteristic kernel and denotes the Dirac measure at .
Under mild regularity conditions (see details in Huang et al. (2022)), satisfies several good properties for any joint distribution of in Definition 3.1: (1) ; (2) if and only if ; (3) if and only if is a measurable function of given . A consistent estimator calculated by geometric graph-based methods (Section 3 in Huang et al. (2022)) is also provided in R Package KPC.
With the aid of KPC, we can rigorously quantify the violation of equalized odds by estimating , where can take arbitrary form and response can be continuous (regression) or categorical (classification).
3.2 Hypothesis test for equalized odds
To this end, we provide a formal hypothesis test with a statistical guarantee to detect any violation of equalized odds. Our hypothesis test once again uses the permuted version of and implements a conditional independence test. The idea is that we keep generating fake copies by (7), and by Theorem 2.1, will have the same distribution as under the assumption of equalized odds (1). Therefore, we can use any test statistic to obtain a valid hypothesis test since any test statistic will also have the same distribution as under the assumption of equalized odds. The procedure of our proposed hypothesis test is in Algorithm 2.
Proposition 3.2.
Input: Data ,
Parameter: the number of synthetic copies .
Output: A -value for the hypothesis that equalized odds (1) holds.
We note that a similar hypothesis test for equalized odds is proposed in Romano et al. (2020) which is done by using a resampled version of and choosing in Algorithm 2 as described in Holdout Permutation Test (Tansey et al., 2022), which is based on a predictor aiming to predict and is formulated as the empirical risk (e.g., mean squared error). However, such chosen in Tansey et al. (2022) itself cannot serve as an accurate dependence measure as KPC does.
4 Experiments
In this section, we conduct numerical experiments to examine the effectiveness of the proposed approach on both synthetic datasets and real datasets.222The code is available at https://github.com/yuhenglai/FairICP All the details are included in Appendix D.
4.1 Experiments on synthetic datasets
4.1.1 Synthetic data generation
In this section, we explore the performance of FairICP in simulations with a continuous response , and potentially multiple sensitive attributes are differently involved by two mechanisms:
-
•
Simulation 1: The response depends on two set of features and :
(Sim1) -
•
Simulation 2: The response depends on two features and :
(Sim2) -
•
is influenced by multiple sensitive attributes in the setting Sim1 and influenced by a sole sensitive attribute in the setting Sim2. The parameter controls the dependence of the predictive feature on , and we consider as a high dependence scenario and as a low dependence scenario in our experiments.
- •
We compare the proposed method FairICP to FDL and an oracle version of FairICP where is given as the true conditional density. These synthetic experiments are where we can reliably evaluate the violation of the equalized odds condition of different methods. We are interested in 1) investigating if FairICP is more effective than FDL as the number of noisy attributes increases (increased ) by considering the easier problem of estimating the density of rather than ; and 2) evaluating if KPC is a good measure for conditional dependence in the sense that it can capture the relative degree of violation of equalized odds when applying different methods to the same data sets.
4.1.2 Results on synthetic datasets
We compare FairICP with estimated by MAF (Papamakarios et al., 2017)), FDL with estimated by MAF, and the oracle version of FairICP with true density. For the measure of the violation of equalized odds, we calculate the empirical KPC as R Package KPC with Gaussian kernel and default parameters (Huang et al., 2022). Apart from the KPC measure itself, we also consider a second evaluation metric using a hypothesis test as outlined by Algorithm 2 with , where we consider the power of rejecting the null hypothesis at level as a measure of conditional dependence when utilizing the underlying true conditional density. The greater or rejection power indicates stronger conditional dependence between and given . Note that, in Sim2 only influences the , so the test will be based on to exclude the effects of noise (though the training is based on for all methods to demonstrate the performance under noise).
Figure 2 and 3 show the trade-off curves between prediction loss and degree of fairness violations measured by KPC or its associated fairness testing power by Algorithm 2 with under settings Sim1 and Sim2 respectively, with under the high-dependence scenario (Results with low dependence on A are shown Appendix D.1). We implemented as linear model and as neural network, and all methods being compared are trained with different penalty parameter to show the trade-off. In both simulations, the trade-off by Pareto fronts is based on 100 independent runs with a sample size of 500 for the training set and 400 for the test set.
Figure 2 shows the results from the setting Sim1. Models from all three methods reduce to a plain linear regression without regard to fairness when , resulting in low prediction loss but a severe violation of equalized odds (evidenced by large KPC and statistical power); as goes larger, models pay more attention to fairness (lower KPC and power) by sacrificing more prediction loss. FairICP (proposed) performs very closely to the oracle model while outperforming FDL as the dimension of gets larger using both the KPC measure and the power measure, which fits our expectation and follows from the increased difficulty of estimating the conditional density of . FairICP shows a noticeable but still less performance reduction compared to the oracle model measured by KPC when the dimension of is 10, which is already large compared to what is examined in the current literature. Of note, this slight difference does not show up when measured by the power, likely due to an information loss when dichotomies the continuous KPC measure into the 0-1 decision given the -value cutoff.
Figure 3 shows the results from setting Sim2 and delivers a similar message as Figure 2. The gaps between FairICP and FDL are wider compared to the results in Figure 2 as increases, which echos less percent of information about needed for estimating in setting Sim2.
Note that the power measure depends on how the permutation/sampling is conducted in practice, and its reliability hinges on the correctness of the sampling scheme, and thus, the accuracy of density estimation. In contrast, the direct KPC (Kernel-based Pearson Correlation) measure is independent of density estimation. Therefore, we can trust the power evaluation in our synthetic experiments, as we have utilized true conditional density estimation. The consistency between KPC measures and the power measures in our synthetic experiments suggests that KPC is a reasonable and density-estimation-free measure in real applications for comparing different learning methods
4.2 Real-data experiments
We consider real-world cases where we may need to protect more than one sensitive attribute. For all the experiments, we split the data into a training set (60%) and a test set (40%), and all the results shown are based on the test set.
4.2.1 Fair regression
In the Communities and Crime dataset 333Available at the UC Irvine Data Repository http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime, each record describes the aggregate demographic properties of a different U.S. community; the data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. The total number of records is 1994, and the number of features is 122. Our task here is to predict the number of violent crimes per population for US cities while protecting all race information to avoid biasing police checks depending on the ethnic characteristics of the community. Specifically, we take three minority race information in this dataset into account (African American, Hispanic, Asian) as sensitive attributes instead of only one kind of race as done in the previous literature. We also consider the case where only includes one race (African American) as in Romano et al. (2020); Mary et al. (2019) for better comparison. All the sensitive attributes used here are continuous, representing the percentage of the population of certain races.
We compare our proposed methods with FDL (Romano et al., 2020) and HGR (Mary et al., 2019)444In Mary et al. 2019, since their implementation doesn’t directly apply to multiple sensitive attributes, we set the mean of three HGR coefficient of each attribute as penalty.. Note that we don’t include sensitive attributes as features in our experiments as in Romano et al. (2020); Mary et al. (2019). We consider neural networks as predictor in all methods555We also consider as a linear model in Appendix D.2, and we tune the hyperparameters as in (Romano et al., 2020) (see details in Appendix D).
We present our results as Pareto front in Fig 4 to show the trade-off curves of prediction and fairness given by our method and the state-of-the-art methods where the fairness is measured by both KPC and the power from the statistical test for fairness as outlined by Algorithm 2 with chosen as KPC. We see that both metrics give similar trends: although there are some small discrepancies between using KPC and the fairness test, we observe that FairICP outperforms FDL and HGR especially when both three sensitive attributes are considered. Although the conditional density is now estimated and the fairness test might suffer from it, KPC is a robust measure regardless of the sampling scheme for .
4.2.2 Fair classification
We then turn to a binary classification case that has been well-studied and considers two categorical sensitive attributes. The dataset we consider is ProPublica’s COMPAS recidivism data (5278 examples) 666Although it’s widely used in fairness-related literature, recently there have been critiques about the limitations of this dataset (Bao et al., 2022).. The task is to predict recidivism from someone’s criminal history, jail and prison time, demographics, and COMPAS risk scores. We choose two binary protected attributes : race (white vs. non-white) and sex. For this special task (binary classification against multiple binary sensitive attributes), we compare FairICP to two baselines HGR (Mary et al., 2019) and Exponentiated-gradient reduction (Agarwal et al., 2018), with the later developed for this particular kind of task. We aim to use this example to demonstrate the ability of FairICP to handle categorical observations and provide comparable performance with regard to the more tailored approach.
In addition, apart from KPC and the corresponding fairness test, we also consider another fairness metric based on confusion matrix (Hardt et al., 2016; Cho et al., 2020) designed for such a binary classification task with categorical sensitive attributes to quantify equalized odds:
(8) |
where is the predicted class label.
Similar to the regression case, we train neural network models as classifiers and discriminators 777We also consider as a linear model in Appendix D.2 (see details in Appendix D).
Figure 5 shows that all three methods behave similarly overall in this classification example regarding their prediction-fairness trade-offs, with FairICP closely matching the performance of the exponential-gradient reduction (referred to as Reduction) using all three fairness evaluation metrics, and HGR slightly worse than FairICP and Reduction when measured by DEO.
5 Discussion
We introduced a flexible fairness learning approach, FairICP, to address the challenge of achieving equalized-odds fairness with complex sensitive attributes. FairICP combines adversarial learning with a novel inverse conditional permutation (ICP) strategy and offers a flexible and effective solution for handling sensitive attributes that may be of mixed data types and multidimensional in nature. We provided theoretical insights into the proposed method, elucidating the underpinning concepts and the rationale behind integrating ICP with adversarial learning. Furthermore, we conducted numerical experiments on both synthetic and real data to support our theoretical insights and demonstrate the efficacy and flexibility of our proposed method. We also noted that the majority of the computational burden for FairICP lies in training the adversarial prediction model, based on our experience (as also mentioned in Zhang et al. (2018); Romano et al. (2020)), with that from the density estimation and ICP sampling being negligible in comparison. The scalability challenge of the adversarial techniques should be more carefully addressed by implementing more efficient methods, which we view as a future direction for improving FairICP
References
- Agarwal et al. (2018) Agarwal, A., A. Beygelzimer, M. Dudík, J. Langford, and H. Wallach (2018). A reductions approach to fair classification. In International conference on machine learning, pp. 60–69. PMLR.
- Bao et al. (2022) Bao, M., A. Zhou, S. Zottola, B. Brubach, S. Desmarais, A. Horowitz, K. Lum, and S. Venkatasubramanian (2022). It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks.
- Berrett et al. (2020) Berrett, T. B., Y. Wang, R. F. Barber, and R. J. Samworth (2020). The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society Series B: Statistical Methodology 82(1), 175–197.
- Candès et al. (2018) Candès, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B 80(3), 551–577.
- Castelnovo et al. (2022) Castelnovo, A., R. Crupi, G. Greco, D. Regoli, I. G. Penco, and A. C. Cosentini (2022). A clarification of the nuances in the fairness metrics landscape. Scientific Reports 12(1), 4209.
- Cho et al. (2020) Cho, J., G. Hwang, and C. Suh (2020). A fair classifier using kernel density estimation. Advances in neural information processing systems 33, 15088–15099.
- Creager et al. (2019) Creager, E., D. Madras, J.-H. Jacobsen, M. Weis, K. Swersky, T. Pitassi, and R. Zemel (2019, 09–15 Jun). Flexibly fair representation learning by disentanglement. In K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, Volume 97 of Proceedings of Machine Learning Research, pp. 1436–1445. PMLR.
- Dwork et al. (2012) Dwork, C., M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012). Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214––226.
- Feldman et al. (2015) Feldman, M., S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015). Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 259–268.
- Goodfellow et al. (2014) Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680.
- Hardt et al. (2016) Hardt, M., E. Price, and N. Srebro (2016). Equality of opportunity in supervised learning. Advances in neural information processing systems 29.
- Hebert-Johnson et al. (2018) Hebert-Johnson, U., M. Kim, O. Reingold, and G. Rothblum (2018, 10–15 Jul). Multicalibration: Calibration for the (Computationally-identifiable) masses. In J. Dy and A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research, pp. 1939–1948. PMLR.
- Huang et al. (2022) Huang, Z., N. Deb, and B. Sen (2022). Kernel partial correlation coefficient — a measure of conditional dependence. Journal of Machine Learning Research 23(216), 1–58.
- Kamishima et al. (2011) Kamishima, T., S. Akaho, and J. Sakuma (2011). Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650. IEEE.
- Kearns et al. (2018) Kearns, M., S. Neel, A. Roth, and Z. S. Wu (2018, 10–15 Jul). Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In J. Dy and A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research, pp. 2564–2572. PMLR.
- Keener (2010) Keener, R. W. (2010). Theoretical statistics: Topics for a core course. Springer.
- Kim et al. (2018) Kim, M. P., A. Ghorbani, and J. Zou (2018). Multiaccuracy: Black-box post-processing for fairness in classification.
- Kusner et al. (2017) Kusner, M. J., J. Loftus, C. Russell, and R. Silva (2017). Counterfactual fairness. In Advances in Neural Information Processing Systems 30, pp. 4066–4076.
- Mary et al. (2019) Mary, J., C. Calauzenes, and N. El Karoui (2019). Fairness-aware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382–4391.
- Mehrabi et al. (2021) Mehrabi, N., F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan (2021). A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 54(6), 1–35.
- Naaman (2021) Naaman, M. (2021). On the tight constant in the multivariate dvoretzky–kiefer–wolfowitz inequality. Statistics & Probability Letters 173, 109088.
- Papamakarios et al. (2017) Papamakarios, G., T. Pavlakou, and I. Murray (2017). Masked autoregressive flow for density estimation. Advances in neural information processing systems 30.
- Romano et al. (2020) Romano, Y., S. Bates, and E. Candes (2020). Achieving equalized odds by resampling sensitive attributes. Advances in neural information processing systems 33, 361–371.
- Scott (1991) Scott, D. W. (1991). Feasibility of multivariate density estimates. Biometrika 78(1), 197–205.
- Tang and Zhang (2022) Tang, Z. and K. Zhang (2022). Attainability and optimality: The equalized odds fairness revisited. In Conference on Causal Learning and Reasoning, pp. 754–786. PMLR.
- Tansey et al. (2022) Tansey, W., V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei (2022). The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics 31(1), 151–162.
- Yang et al. (2022) Yang, J., A. A. S. Soltan, Y. Yang, and D. A. Clifton (2022). Algorithmic fairness and bias mitigation for clinical machine learning: Insights from rapid covid-19 diagnosis by adversarial learning. medRxiv.
- Zafar et al. (2017) Zafar, M. B., I. Valera, M. Gomez Rodriguez, and K. P. Gummadi (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pp. 1171––1180.
- Zafar et al. (2019) Zafar, M. B., I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi (2019). Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research 20(75), 1–42.
- Zemel et al. (2013) Zemel, R., Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013). Learning fair representations. In International conference on machine learning, pp. 325–333. PMLR.
- Zhang et al. (2018) Zhang, B. H., B. Lemoine, and M. Mitchell (2018). Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340.
Appendix A Proofs
Proof of Theorem 2.1.
Let denote the row set of the observed realizations of sensitive attributes (unordered and duplicates are allowed). Let , and be the associated feature, prediction, and response observations.
-
1.
Taks 1: Show that given conditional independence .
Proof of Task 1. Recall that conditional on for some , we have (Berrett et al., 2020):
(9) where is the stacked values in . On the other hand, conditional on , by construction:
(10) where the last equality utilizes the following fact,
Consequently, under the conditional independence assumption, we can write the joint distribution of as the following (, are some stacked observation values and for and respectively):
(11) Here, step has used eq. (9) and eq. (10), which establishes the equivalence between the condition law of and ; steps and relies on the conditional independence relationships and . Hence, conditional independence indicate the distributional equivalence .
-
2.
Taks 2: Show the further conditioned conditional independence given .
-
3.
Taks 3: Show the asymptotic equalized odds given .
Proof of Task 3. We prove this statement utilizing the previous statement and known multi-dimensional c.d.f (cumulative distribution function) estimation bound (Naaman, 2021). Let and be constant vectors of the same dimensions as and , and be a constant vector of the same dimension as . Construct augmented matrix , , where , and , for , and the same for all . Let be a from the same distribution as . Then,
where step has used the fact that , for are independently generated, thus, conditioning on additional independent does not change the probability; step holds because and , for , take infinite values and do not modify the event considered. Utilizing eq. (12), have further have
where step has used again the fact that and , for , and is defined as
Our goal is equivalent to bound . Notice that since are the same for all samples, , , are exchangeable given . Consequently, we obtain that
where step has used the equivalence of , which leads to the -induced empirical c.d.f evaluated at . Also, is a set samples generated conditional on , and denotes the empirical c.d.f induced by and denote the theoretical c.d.f of . From Lemma 4.1 in (Naaman, 2021), which generalizes Dvoretzky–Kiefer–Wolfowitz inequality to multi-dimensional empirical c.d.f to we know
Combine this equality with the bound for , we have
for a sufficiently large as . We thus reached our conclusion that
∎
Proof of Theorem 2.3.
For fixed , the optimal discriminator is reached at
in which case, the discriminating classifier is (See Proposition 1 in (Goodfellow et al., 2014)), and reduces to
where is the Jensen-Shannon divergence between the distributions of and . Plug this this into , we reach the single-parameter form of the original objective:
where the equality holds at . In summary, the solution value is achieved when:
-
•
minimizes the negative -likelihood of under , which happens when are the solutions of an optimal predictor . In this case, reduces to its minimum value
-
•
minimizes the Jensen-Shannon divergence , Since the Jensen–Shannon divergence between two distributions is always non-negative, and zero if and only if they are equal.
The second characterization is equivalent to the condition . Note that this is a population level characterization with corresponding to the case where . As a result, by the asymptotic equalized odds statement in Theorem 2.1, we have that also satisfies equalized odds. ∎
Appendix B Sampling Algorithm
To sample the permutation from the probabilities:
we use the Parallelized pairwise sampler for the CPT proposed in Berrett et al. (2020), which is detailed as follows:
Input: Data , Initial permutation , integer .
Output: Permuted copy .
Appendix C Additional comparisons of CP/ICP
When we know the true conditional laws (conditional density given ) and (conditional density given ), both CP and ICP show provide accurate conditional permutation copies. However, both densities are estimated in practice, and the estimated densities are denoted as and respectively. The density estimation quality will depend on both the density estimation algorithm and the data distribution. While a deep dive into this aspect, especially from the theoretical aspects, is beyond the scope, we provide some additional heuristic insights to assist our understanding of the potential gain of ICP over CP.
When ICP might improve over CP?
According to proof argument of Theorem 4 in Berrett et al. (2020), let be some permuted copies of according to the estimated conditional law , an upper bound of exchangeability violation for and is related to the total variation between the estimated density and (Theorem 4 in Berrett et al. (2020)):
(13) |
where step is from Lemma (B.8) from ghosal2017fundamentals. We adapt the proof arguments of Theorem 4 in Berrett et al. (2020) to the ICP procedure.
Specifically, let be the conditional permutation of according to and be a new copy sampled according to . We will have
(14) |
There is one issue before we can compare the two CP and ICP upper bounds for exchangeability violations: the two bounds consider different variables and conditioning events. Notice that we care only about the distributional level comparisons, hence, we can apply permutation to and . The resulting is equivalent to and the resulting is exactly the ICP conditionally permuted version. Next we can remove the conditioning event by marginalizing out and in (C) and (14) respectively. Hence, we obtain upper bounds for violation of exchangeability using CP and ICP permutation copies, which is smaller for ICP if is more accurate on average:
ICP achieved higher quality empirically
To illustrate that ICP can provide resampling distribution closer to that of the oracle conditional permutation compared to CP, both utilizing off-the-shelf tools for density estimation with varying dimensions, we consider the following examples:
(1)let . Here be independently generated from either the standard normal or a mixed Gamma distribution ; is a randomly generated covariance matrix with eigenvalues equally spaced on .
(2) let . That is, only influenced by first columns of , with the next columns of be noise.
(3) We estimate / using (1) lasso regression/graphical lasso, where we estimate the linear dependence of on and variance empirically for and estimate assuming joint normality of . For and , OLS was used for both estimations, and (2) MAF, which was default in our paper.
We set , , and the sample size for density estimation and evaluating the conditional permutation distribution to both be 200. We are interested in the total variation difference between permutations using ICP and CP using the estimated densities to that using the true density, which is explicitly known in this example up to a normalization constant.
Due to the large permutation space, the calculation of the actual total variation distance is difficult. To circumvent this challenge, we restrict the permutation space to swap** actions: we consider the TV distance ( transformed) restricted to permutations that swaps and for and the original order, and compare ICP and CP to the oracle conditional permutations on such permutations only.
Figure 6 and Figure 7 show results using (1) MAF and (2) cross-validated lasso regression or graphical lasso, respectively (repeated 20 times for each setting). We see that the TV distances between ICP and the oracle are smaller than the corresponding ones for CP using both density estimation approaches. MAF is a default density estimation approach for general purposes. By design, lasso regression/OLS is favored over MAF for estimating in this particular example. There may be better density estimation choices in other applications, but overall, estimating can be simpler and allows us to utilize existing tools, e.g., those designed for supervised learning.
Appendix D Experiments
In both simulation studies and real-data experiments, we implement the algorithms with the hyperparameters chosen by the tuning procedure as in Romano et al. (2020). In practice, we tune the hyperparameters only once using 10-fold cross-validation on the entire data set and then treat the chosen set as fixed for the rest of the experiments. Then we compare the performance metrics of the different algorithms on 100 data splits that are different than the ones used to tune the parameters. This same tuning scheme is used for all methods, ensuring that the comparisons are meaningful.
D.1 Experiments on synthetic datasets
For all the models evaluated (FairICP, Oracle, FDL), we set the hyperparameters as follows:
-
•
We set as a linear model and use the Adam optimizer with a mini-batch size in {16, 32, 64}, learning rate in {1e-4, 1e-3, 1e-2}, and the number of epochs in {20, 40, 60, 80, 100, 120, 140, 160, 180, 200}. The discriminator is implemented as a four-layer neural network with a hidden layer of size 64 and ReLU non-linearities. We use the Adam optimizer, with a fixed learning rate of 1e-4.
D.1.1 Low sensitive attribute dependence for Sim1
We report the results with A-dependence here:
D.1.2 Low sensitive attribute dependence for Sim2
We report the results with A-dependence here:
D.2 Real-data experiments
D.2.1 Regression
For FairICP and FDL, the hyperparameters used for linear model and neural network are as follows:
-
•
Linear: we set as a linear model and use the Adam optimizer with a mini-batch size in {16, 32, 64}, learning rate in {1e-4, 1e-3, 1e-2}, and the number of epochs in {20, 40, 60, 80, 100}. The discriminator is implemented as a four-layer neural network with a hidden layer of size 64 and ReLU non-linearities. We use the Adam optimizer, with a fixed learning rate of 1e-4. The penalty parameter is set as .
-
•
Neural network: we set as a two-layer neural network with a 64-dimensional hidden layer and ReLU activation function. The hyperparameters are the same as the linear ones.
For HGR, the hyperparameters used for the linear model and neural network are as follows:
-
•
Linear: we set as a linear model and use the Adam optimizer with a mini-batch size in {16, 32, 64}, learning rate in {1e-4, 1e-3, 1e-2}, and the number of epochs in {20, 40, 60, 80, 100}. The penalty parameter is set as .
-
•
Neural network: we set as a two-layer neural network with a 64-dimensional hidden layer and ReLU activation function. The hyperparameters are the same as the linear ones.
We report the results with as a linear model here, which is similar to NN version:
D.2.2 Classification
For FairICP, the hyperparameters used for linear model and neural network are as follows:
-
•
Linear: we set as a linear model and use the Adam optimizer with a mini-batch size in {64, 128, 256}, learning rate in {1e-4, 1e-3, 1e-2}, and the number of epochs in {50, 100, 150, 200, 250, 300}. The discriminator is implemented as a four-layer neural network with a hidden layer of size 64 and ReLU non-linearities. We use the Adam optimizer, with a fixed learning rate in {1e-4, 1e-3}. The penalty parameter is set as .
-
•
Neural network: we set as a two-layer neural network with a 64-dimensional hidden layer and ReLU activation function. The hyperparameters are the same as the linear ones.
For HGR, the hyperparameters used for the linear model and neural network are as follows:
-
•
Linear: we set as a linear model and use the Adam optimizer with a mini-batch size in {64, 128, 256}, learning rate in {1e-4, 1e-3, 1e-2}, and the number of epochs in {20, 40, 60, 80, 100}. The penalty parameter is set as .
-
•
Neural network: we set as a two-layer neural network with a 64-dimensional hidden layer and ReLU activation function. The hyperparameters are the same as the linear ones.
We report the results with as a linear model here: