\MakePerPage

footnote

Flexible Fairness Learning via Inverse Conditional Permutation

Yuheng Lai, Leying Guan Dept. of Mathematics, Renmin University of ChinaDept. of Biostatistics, Yale University.
Correspondence to: [email protected]
This work is supported by NSF DMS-2310836.

Abstract

Equalized odds, as a popular notion of algorithmic fairness, aims to ensure that sensitive variables, such as race and gender, do not unfairly influence the algorithm prediction when conditioning on the true outcome. Despite rapid advancements, most of the current research focuses on the violation of equalized odds caused by one sensitive attribute, leaving the challenge of simultaneously accounting for multiple attributes under-addressed. We address this gap by introducing a fairness learning approach that integrates adversarial learning with a novel inverse conditional permutation. This approach effectively and flexibly handles multiple sensitive attributes, potentially of mixed data types. The efficacy and flexibility of our method are demonstrated through both simulation studies and empirical analysis of real-world datasets.

1 Introduction

Machine learning models have become important tools for aiding decision-making in various applications. One of the challenges in applying machine learning is ensuring that the models are fair, i.e., they do not discriminate against minorities or other protected groups (Mehrabi et al., 2021). Several fairness concepts have been developed in the literature to address different practical needs (Mehrabi et al., 2021; Castelnovo et al., 2022). In this work, we consider the equalized odds criterion (Hardt et al., 2016), defined as

\hat{Y}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2.0% mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptscriptstyle\perp$% \hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y.

(1)

Here, $Y$ is the response variable, $A$ is the sensitive attribute(s) that we care to protect(e.g. gender / race / income), and $\hat{Y}$ is the prediction given by any model. Notice that, when drop** the conditional term, (1) becomes unconditional independence as $\hat{Y}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2.0% mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptscriptstyle\perp$% \hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A$ to accommodate the need of demographic parity. The requirements for a learned model to satisfy certain independence relations are not limited to the realm of fairness: in this broader context, statisticians have long been working on robust inference techniques based on the concept of a pivot – a quantity whose distribution is invariant with respect to the nuisance parameters (see, e.g., Keener (2010)).

Despite the exciting progress, most existing algorithms aiming for equalized odds can only handle one protected attribute. However, in fields including clinical research, there is a growing need to mitigate biases related to multiple sensitive attributes (Yang et al., 2022). It has also been pointed out that fairness gerrymandering can occur when algorithmic decision-making considers only a single sensitive attribute at a time (Kearns et al., 2018). Additionally, the equalized-odds problem in the context of continuous sensitive attributes is much less explored.

Here, we alleviate these two limitations by proposing a versatile equalized odds training scheme, FairICP, as illustrated by Figure 1: Building on the sensitive attribute resampling framework (Romano et al., 2020), we generate $\tilde{A}$ using a novel inverse conditional permutation (ICP) strategy, conditional permutations of $A$ given $Y$ , and construct a more fair model through regularizing the distribution of $(\hat{Y},A,Y)$ toward the distribution of $(\hat{Y},\tilde{A},Y)$ (see Figure 1). Our contributions are summarized as below.

•

We propose a novel inverse conditional permutation (ICP) strategy to generate $\tilde{A}$ , conditional permutations of $A$ , without estimating the multi-dimensional conditional density of $A|Y$ .
•

We show theoretically that the equalized odds condition holds asymptotically for $(\hat{Y},\tilde{A},Y)$ when the $\tilde{A}$ is generated according to ICP.
•

We propose examining the fairness level with a recently developed non-parametric conditional dependence measure.
•

We demonstrate experimentally that FairICP enjoys improved efficacy and flexibility.

Refer to caption — Figure 1: Illustration of the FairICP framework. $A$ , $X$ , and $Y$ denote the sensitive attributes, other features, and labels.

Related work

Existing fairness concepts can be divided into different categories, including statistical/group fairness (Hardt et al., 2016; Zafar et al., 2017), which aims to ensure similar predictions across different groups; individual fairness (Dwork et al., 2012), which targets similar predictions for similar individuals; and causality-based fairness (Kusner et al., 2017), which tries to reveal causal relationships. More comprehensive discussions can be found in (Mehrabi et al., 2021; Castelnovo et al., 2022). Prominent statistical fairness measures include demographic parity (Zafar et al., 2019), equal opportunity (Hardt et al., 2016), and equalized odds (Hardt et al., 2016), which can all be articulated as (conditional) independence relations from a statistical perspective. Given the fairness concept, the associated procedures can be generally categorized into three types: (1) pre-processing, (2) post-processing, and (3) in-processing. Pre-processing aims to correct potentially biased data before any model fitting procedures (Zemel et al., 2013; Feldman et al., 2015), while post-processing modifies the classifier’s output at the test phase, leaving the model unchanged (Hardt et al., 2016; Kim et al., 2018; Hebert-Johnson et al., 2018).

FairICP is an in-processing method that encourages equalized-odds fairness for multiple complex sensitive attributes during model training. Several in-processing methods have been previously introduced to address the violation of equalized odds. For example, Agarwal et al. (2018) describes a procedure for handling categorical sensitive attributes for binary classification. Mary et al. (2019) trains a model that penalizes the violation of equalized-odds, measured by the Hirschfeld-Gebelein-Rényi (HGR) Maximum Correlation Coefficient, and is designed to reduce equalized-odds violations in the presence of one sensitive attribute, whether categorical or continuous. Closely related to FairICP, another line of in-processing algorithms encourages fairness using an adversarial loss designed for different fairness metrics (Zhang et al., 2018). Particularly, Romano et al. (2020) proposes a novel adversarial learning loss that utilizes the resampled synthetic variable $\tilde{A}$ from the conditional distribution of a potentially continuous $A$ conditional on $Y$ . Although the joint consideration of multiple sensitive attributes has been explored for demographic parity under this framework (Creager et al., 2019), jointly modeling multiple sensitive attributes, especially continuous ones, remains an unresolved challenge. This challenge is largely due to the difficulty of estimating the conditional density of $A|Y$ . Our approach shares similar loss designs with that of Romano et al. (2020) but employs a novel permutation technique capable of handling multiple and complex protected variables.

2 Method

We propose a general adversarial learning procedure to obtain models with improved equalized odds guarantee through utilizing a novel Inverse Conditional Permutation (ICP). The proposed procedure FairICP enables efficient fairness learning with multi-dimensional sensitive attributes with either categorical or continuous response $Y$ . Before describing our proposal, we first define some notations used throughout this paper. We will also review the framework of model training with equalized odds penalty based on sensitive attribute re-sampling Romano et al. (2020) and the challenge in applying sensitive attribute re-sampling and existing methods for multidimensional attributes, which motivates our proposal.

Let $\left(X_{i},A_{i},Y_{i}\right)$ for $i=1,\ldots,n_{\mathrm{tr}}$ be i.i.d. generated triples of (feature, sensitive attribute, response). Let $f_{\theta_{f}}(.)$ be a prediction function with model parameter $\theta_{f}$ . Although $f_{\theta_{f}}(.)$ can be any prediction that is differentiable in $\theta_{f}$ , we will consider $f_{\theta_{f}}(.)$ as the neural network throughout this work. Let $\hat{Y}=f_{\theta_{f}}(X)$ be the prediction for $Y$ given $X$ . For a regression problem, $\hat{Y}$ is the predicted value of the continuous response $Y$ ; for a classiﬁcation problem, the last layer of $f_{\theta_{f}}(.)$ is a softmax layer and $\hat{Y}$ is the predicted probability vector for being in each class. We also denote ${\mathbf{X}}=\left(X_{1},\ldots,X_{n_{\mathrm{tr}}}\right),{\mathbf{A}}=\left(% A_{1},\ldots,A_{n_{\mathrm{tr}}}\right)$ , ${\mathbf{Y}}=\left(Y_{1},\ldots,Y_{n_{\mathrm{tr}}}\right)$ and $\hat{\mathbf{Y}}=(\hat{Y}_{1},\ldots,\hat{Y}_{n_{\mathrm{tr}}})$ .

2.1 Fairness-learning via sensitive attribute re-sampling

We first present the framework (Romano et al., 2020) denoted by Fair Dummies Learning (FDL) to achieve equalized odds for one sensitive attribute. Our terminology will differ somewhat from the terminology used in this reference, to help us introduce the new perspectives and frameworks in this paper later on.

To evaluate the potential violation of equalized odds (1) in prediction $\hat{\mathbf{Y}}$ , FDL construct a resampled version of the original sensitive attribute as $\tilde{{\mathbf{A}}}$ to be a contrast and sample $\tilde{A}$ according to $\tilde{{\mathbf{A}}}\sim Q^{n_{\mathrm{tr}}}(\cdot\mid\mathbf{Y})$ , where $Q^{n_{\mathrm{tr}}}(\cdot\mid\mathbf{Y}):=\prod_{1\leq i\leq n_{\mathrm{tr}}}Q% \left(\cdot\mid Y_{i}\right)$ , and $Q(\cdot\mid y)$ denotes the conditional distribution of $A$ given $Y=y$ . Since we generate $\tilde{{\mathbf{A}}}$ without looking at $\hat{{\mathbf{Y}}}$ , the following equalized odds property holds: $\hat{Y_{i}}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2% .0mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}% \mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$% \hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$% \scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\tilde{A_{% i}}\mid Y_{i}$ . Hence, we can measure the degree of violation to the equalized odds condition by measuring the discrepancy between the distribution of $(\hat{{\mathbf{Y}}},{\mathbf{A}},{\mathbf{Y}})$ and the distribution of $(\hat{{\mathbf{Y}}},\tilde{{\mathbf{A}}},{\mathbf{Y}})$ . Following this intuition, FDL utilizes GAN (Goodfellow et al., 2014), to iteratively learn how to separate the two distributions and optimize a fairness-regularized prediction loss. More specifically, define

	$\displaystyle\mathcal{L}_{f}\left(\theta_{f}\right)=\mathbb{E}_{XY}\left[-\log p% _{\theta_{f}}(Y\mid X)\right]$		(2)
	$\displaystyle\mathcal{L}_{d}\left(\theta_{f},\theta_{d}\right)=\mathbb{E}_{% \hat{Y}AY}[-\log{D}_{\theta_{d}}(\hat{Y},A,Y)]+\mathbb{E}_{\hat{Y}\tilde{A}Y}[% -\log(1-{D}_{\theta_{d}}(\hat{Y},\tilde{A},Y))]$		(3)
	$\displaystyle V_{\mu}(\theta_{f},\theta_{d})=(1-\mu)\mathcal{L}_{f}(\theta_{f}% )-\mu\mathcal{L}_{d}(\theta_{f},\theta_{d})$		(4)

as the expected negative log-likelihood loss, the discriminator loss, and value function respectively, where $D_{\theta_{d}}(.)$ is the classifier which separates $(\hat{{\mathbf{Y}}},{\mathbf{A}},{\mathbf{Y}})$ and $(\hat{{\mathbf{Y}}},\tilde{\mathbf{A}},{\mathbf{Y}})$ , and $\mu\in[0,1]$ is a tuning parameter that controls the prediction-fairness trade-off. Then, FDL learns $\theta_{f},\theta_{d}$ by finding the minimax solution

\hat{\theta}_{f},\hat{\theta}_{d}=\arg\min_{\theta_{f}}\max_{\theta_{d}}V_{\mu% }(\theta_{f},\theta_{d}).

(5)

FDL generates $\tilde{A}$ through Conditional Randomization (CR) (Candès et al., 2018), which is done by re-sampling it from its (estimated) conditional distribution given other variables that we want to control for. However, the effectiveness of conditional randomization requires estimation of $Q(A\mid Y)$ , which is challenging when $A$ is multi-dimensional (Scott, 1991). This challenge is not unique to FDL and needs to be addressed for other non-resampling-based approaches such as Holdout Randomization Test (HRT) (Tansey et al., 2022) as well. In addition, the sensitive attributes $A$ can also potentially be both discrete and continuous, which adds another layer of the challenge of estimating $Q(A\mid Y)$ . An approach allows $A$ to have flexible types and scales well with the dimension of $A$ to help the promotion of fairness learning in many social and medical applications.

2.2 Fairness learning via ICP

To circumvent the challenge in learning the conditional density of $A$ given $Y$ , we pivot to estimate $Y$ given $A$ and leverage Conditional Permutation (CP) (Berrett et al., 2020) to generate a permuted version of $\tilde{{\mathbf{A}}}$ which also has the property of equalized odds (1) asymptotically.

CP in fairness learning.

To begin with, we first introduce the vanilla CP strategy to generate permutation copies in Berrett et al. (2020) in our setting.

Let $\mathcal{S}_{n}$ denote the set of permutations on the indices $\{1,\ldots,n\}$ . Given any vector $\mathbf{x}=\left(x_{1},\ldots,x_{n}\right)$ and any permutation $\pi\in\mathcal{S}_{n}$ , define $\mathbf{x}_{\pi}=\left(x_{\pi(1)},\ldots,x_{\pi(n)}\right)$ as permuted version of $\mathbf{x}$ with its entries reordered according to the permutation $\pi$ . Instead of drawing a permutation $\Pi$ uniformly at random, CP assigns unequal sampling probability to permutations based on the conditional probability of observing $A_{\Pi}$ given $Y$ :

\mathbb{P}\left\{\Pi=\pi\mid{\mathbf{A}},{\mathbf{Y}}\right\}=\frac{q^{n}\left% ({\mathbf{A}}_{\pi}\mid{\mathbf{Y}}\right)}{\sum_{\pi^{\prime}\in\mathcal{S}_{% n}}q^{n}\left({\mathbf{A}}_{\pi^{\prime}}\mid{\mathbf{Y}}\right)}.

(6)

Here we let $q(\cdot\mid y)$ be the density of the distribution $Q(\cdot\mid y)$ (i.e., $q(\cdot\mid y)$ is the conditional density of $A$ given $Y=y$ ). We write $q^{n}(\cdot\mid{\mathbf{Y}}):=q\left(\cdot\mid Y_{1}\right)\cdots q\left(\cdot% \mid Y_{n}\right)$ to denote the product density. This leads to the synthetic $\tilde{\mathbf{A}}={\mathbf{A}}_{\Pi}$ , which, intuitively, should have low dependence on $\hat{\mathbf{Y}}$ given ${\mathbf{Y}}$ , and can thus be utilized to encourage equalized odds as described in (1).

ICP circumvents density estimation of $A\mid Y$ .

Unfortunately, conducting conditional permutation with multivariate $A$ relies on conditional density estimation of $A$ given $Y$ and does not alleviate the issue arising from multivariate density estimation as we mentioned earlier. To circumvent this problem, we propose a simple ICP (inverse conditional permutation) strategy which is indirect yet scales better with the dimensionality of $A$ and can adapt easily to various data types of $A$ .

ICP begins with the observation that the distribution of $({\mathbf{A}}_{\Pi},{\mathbf{Y}})$ is identical as the distribution of $({\mathbf{A}},{\mathbf{Y}}_{{\Pi}^{-1}})$ . Hence, intuitively, instead of determining $\Pi$ based on the conditional law of $A$ given $Y$ , we first consider the conditional permutation of $Y$ given $A$ , which is one dimensional and can be estimated conveniently using standard regression or generalized regression techniques regardless of the complexity in $A$ . We then generate $\Pi$ by applying an inverse operator to the distribution of these permutations. Specifically, we generate $\tilde{\mathbf{A}}={\mathbf{A}}_{\Pi}$ with the following probabilities:

\mathbb{P}\left\{\Pi=\pi\mid{\mathbf{A}},{\mathbf{Y}}\right\}=\frac{q^{n}\left% ({\mathbf{Y}}_{{\pi}^{-1}}\mid{\mathbf{A}}\right)}{\sum_{\pi^{\prime}\in% \mathcal{S}_{n}}q^{n}\left({\mathbf{Y}}_{{\pi^{\prime}}^{-1}}\mid{\mathbf{A}}% \right)}.

(7)

Indeed, this intuition helps us to $\tilde{A}$ which can be used to monitor the violation of the equalized odds condition.

Theorem 2.1.

For any $n$ $i.i.d$ observations $({\mathbf{X}},{\mathbf{A}},{\mathbf{Y}})$ , let $\tilde{\mathbf{A}}$ be generated by the ICP sampling scheme (7). Let $S({\mathbf{A}})$ denote the unordered set of rows in ${\mathbf{A}}$ , and let $p$ be the dimension of $A$ . We have

(1) If $\hat{Y}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2.0% mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptscriptstyle\perp$% \hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A\mid Y$ , then $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})\,{\buildrel d\over{=}}\,(\hat{% \mathbf{Y}},\tilde{\mathbf{A}},{\mathbf{Y}})$ .

(2) If $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})\,{\buildrel d\over{=}}\,(\hat{% \mathbf{Y}},\tilde{\mathbf{A}},{\mathbf{Y}})$ , then $\hat{\mathbf{Y}}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}% \mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$% \hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle% \perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$% \scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}{\mathbf{A% }}\mid({\mathbf{Y}}\mbox{ and }S({\mathbf{A}}))$ . Further, when $\frac{\log p}{n}\rightarrow 0$ , the asymptotic equalized odds condition holds: for any constant vectors $t_{1}$ and $t_{2}$ ,

\mathbb{P}\left[\hat{Y}\leq t_{1},A\leq t_{2}|Y\right]-\mathbb{P}\left[\hat{Y}% \leq t_{1}|Y\right]\mathbb{P}\left[A\leq t_{2}|Y\right]\overset{n\rightarrow% \infty}{\rightarrow}0.

Remark 2.2.

In FDL, the availability of accurate conditional density $A|Y$ enables the equivalence between $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})\,{\buildrel d\over{=}}\,(\hat{% \mathbf{Y}},\tilde{\mathbf{A}},{\mathbf{Y}})$ and $\hat{\mathbf{Y}}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}% \mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$% \hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle% \perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$% \scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}{\mathbf{A% }}\mid{\mathbf{Y}}$ , ICP pays an almost negligible price and offers a fast-rate asymptotic equivalence but circumvents the density estimation of $A\mid Y$ .

Motivated by this, we propose an adversarial learning procedure utilizing the permuted sensitive attributes $\tilde{A}$ from the ICP sampling scheme (7), which is built under the same formulation of the loss function shown previously in Section 2.1. Let $\hat{\mathcal{L}}_{f}(\theta_{f})$ and $\hat{\mathcal{L}}_{d}(\theta_{f},\theta_{d})$ be the empirical realizations of the losses $\mathcal{L}_{f}(\theta_{f})$ , and $\mathcal{L}_{d}(\theta_{f},\theta_{d})$ defined in (2) and (3) respectively. Algorithm 1 presents the details. We detail the permutation sampling algorithm, Parallelized pairwise sampler, in Appendix B for the sake of completeness, which is adapted from Berrett et al. (2020).

Theorem 2.3.

If there exists a minimax solution $(\hat{\theta}_{f},\hat{\theta}_{d})$ for $V_{\mu}\left(.,.\right)$ defined in (5) such that $V_{\mu}(\hat{\theta}_{f},\hat{\theta}_{r})=$ $(1-\mu)H(Y\mid X)-\mu\log(4)$ , where $H(Y\mid X)=\mathbb{E}_{XY}\left[-\log p(Y\mid X)\right]$ denotes the conditional entropy, then $\hat{f}_{\hat{\theta}_{f}}(\cdot)$ is both an optimal and fair predictor, which simultaneously minimizes $\mathcal{L}_{f}\left(\theta_{f}\right)$ and satisfies equalized odds simultaneously.

Input: Data $({\mathbf{X}},{\mathbf{A}},{\mathbf{Y}})=\{(X_{i},A_{i},Y_{i})\}_{i\in\mathcal% {I}_{\mathrm{tr}}}$

Parameters: penalty weight $\mu$ , step size $\alpha$ , number of gradient steps $N_{g}$ , and iterations $T$ .

Output: predictive model $\hat{f}_{\hat{\theta}_{f}}(\cdot)$ and discriminator $\hat{D}_{\hat{\theta}_{d}}(\cdot)$ .

1:for

t=1,\dots,T

2: Generate permuted copy

\tilde{\mathbf{A}}

by (7) (using the sampler described in Appendix B)

3: Update the discriminator parameters

\theta_{d}

by repeating the following for

N_{g}

gradient steps:

\theta_{d}\leftarrow\theta_{d}-\alpha\nabla_{\theta_{d}}\hat{\mathcal{L}}_{d}(% \theta_{f},\theta_{d}).

4: Update the predictive model parameters

\theta_{f}

by repeating the following for

N_{g}

gradient steps:

\theta_{f}\leftarrow\theta_{f}-\alpha\nabla_{\theta_{f}}\left[(1-\mu)\hat{% \mathcal{L}}_{f}(\theta_{f})-\mu\hat{\mathcal{L}}_{d}(\theta_{f},\theta_{d})% \right].

5:end for

Output: Predictive model $\hat{f}_{\hat{\theta}_{f}}(\cdot)$ .

Algorithm 1 Fairness learning via ICP

In practice, the assumption of the existence of an optimal and fair predictor (in terms of equalized odds) may not hold (Tang and Zhang, 2022). Setting $\mu$ to a large value will preferably enforce $f$ to satisfy equalized odds while setting $\mu$ close to 0 will push $f$ to be optimal: an increase in accuracy would often be accompanied by a decrease in fairness and vice-versa.

2.3 Density Estimation

The estimation of conditional densities is a crucial part of both our method and previous work (Romano et al., 2020; Mary et al., 2019). However, unlike the previous work which requires the estimation of $A\mid Y$ , our proposal looks into the inverse relationship of $Y\mid A$ . In practice, our proposed method can easily leverage the state-of-the-art density estimator and is less disturbed by the increased complexity in $A$ , due to either dimension or data types.

In this manuscript, we applied Masked Autoregressive Flow (MAF) (Papamakarios et al., 2017) to estimate the conditional density of $Y|A$ when $Y$ is continuous and $A_{1},\ldots,A_{k}$ can take arbitrary data types (discrete or continuous) ¹¹1In MAF paper (Papamakarios et al., 2017), to estimate $p(U\mid V)$ , $U$ is assumed to be continuous while $V$ can take arbitrary form, but there’s no requirements about the dimensionality of $U$ and $V$ . In classification scenario when $Y\in\{0,1,\ldots,L\}$ , one can always fit a classifier to model $Y|A$ . To this end, FairICP is more feasible to handle more complex sensitive attributes and is suitable for both regression and classification tasks. To provide more theoretical and empirical insights into how the quality of density estimation affects CP and ICP, we have additional analysis in Appendix C.

3 Measuring the violation of equalized odds

To gain a reliable understanding of the potential violation of equalized odds using the trained model $\hat{f}$ , we carry out a disciplined evaluation utilizing an untouched test set $({\mathbf{X}}^{te},{\mathbf{A}}^{te},{\mathbf{Y}}^{te})=\{(X_{i},A_{i},Y_{i})% \}_{1\in\mathcal{I}_{te}}$ and a recently proposed conditional independence measure.

3.1 Measure of Conditional Dependence

From a statistical point of view, we note that equalized odds (1) is exactly the notion of conditional independence. Thus, measuring the violation of equalized odds is equivalent to measuring conditional independence, and there have been some works trying to bridge these two problems (Mary et al., 2019; Kamishima et al., 2011; Romano et al., 2020).

In Mary et al. (2019), Hirschfeld-Gebelein-Renyi Maximum Correlation Coefficient (HGR) is chosen to measure the conditional dependence for equalized odds and used as a penalty term to fit a fair model. However, the estimation of HGR, which is based on kernel density estimation of $A$ , becomes difficult when $A$ is multivariate. Here, we take advantage of recent developments in conditional dependence measures and link them to our problem by introducing a flexible measure proposed by Huang et al. (2022).

Definition 3.1.

Kernel Partial Correlation (KPC) coefficient $\rho^{2}\equiv\rho^{2}(U,V\mid W)$ is defined as:

\rho^{2}(U,V\mid W):=\frac{\mathbb{E}\left[\operatorname{MMD}^{2}\left(P_{U% \mid WV},P_{U\mid W}\right)\right]}{\mathbb{E}\left[\operatorname{MMD}^{2}% \left(\delta_{U},P_{U\mid W}\right)\right]},

where $(U,V,W)\sim P$ and $P$ is supported on a subset of some topological space $\mathcal{U}\times\mathcal{V}\times\mathcal{W}$ , MMD is the maximum mean discrepancy - a distance metric between two probability distributions depending on the characteristic kernel $k(\cdot,\cdot)$ and $\delta_{U}$ denotes the Dirac measure at $U$ .

Under mild regularity conditions (see details in Huang et al. (2022)), $\rho^{2}$ satisfies several good properties for any joint distribution of $(U,V,W)$ in Definition 3.1: (1) $\rho^{2}\in[0,1]$ ; (2) $\rho^{2}=0$ if and only if $U\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptscriptstyle\perp$% \hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}V\mid W$ ; (3) $\rho^{2}=1$ if and only if $U$ is a measurable function of $V$ given $W$ . A consistent estimator calculated by geometric graph-based methods $\hat{\rho^{2}}$ (Section 3 in Huang et al. (2022)) is also provided in R Package KPC.

With the aid of KPC, we can rigorously quantify the violation of equalized odds by estimating $\rho^{2}(\hat{Y},A\mid Y)$ , where $A$ can take arbitrary form and response $Y$ can be continuous (regression) or categorical (classification).

3.2 Hypothesis test for equalized odds

To this end, we provide a formal hypothesis test with a statistical guarantee to detect any violation of equalized odds. Our hypothesis test once again uses the permuted version of $\tilde{\mathbf{A}}$ and implements a conditional independence test. The idea is that we keep generating fake copies $\tilde{\mathbf{A}}$ by (7), and by Theorem 2.1, $(\hat{\mathbf{Y}},\tilde{\mathbf{A}},{\mathbf{Y}})$ will have the same distribution as $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})$ under the assumption of equalized odds (1). Therefore, we can use any test statistic $T$ to obtain a valid hypothesis test since any test statistic $T(\hat{\mathbf{Y}},\tilde{\mathbf{A}},{\mathbf{Y}})$ will also have the same distribution as $T(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})$ under the assumption of equalized odds. The procedure of our proposed hypothesis test is in Algorithm 2.

Proposition 3.2.

Suppose the test observations $({\mathbf{X}}^{te},{\mathbf{A}}^{te},{\mathbf{Y}}^{te})=\{(X_{i},Y_{i},A_{i})$ for $1\leq i\leq n_{\mathrm{te}}\}$ are i.i.d.. $\hat{\mathbf{Y}}^{te}=\{\hat{f}(X_{i})$ for $1\leq i\leq n_{\mathrm{te}}\}$ for a learned model $\hat{f}$ (not necessarily trained as our proposed method). If $H_{0}:\hat{\mathbf{Y}}^{te}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle% \perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$% \textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$% \scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0% pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}{% \mathbf{A}}^{te}\mid{\mathbf{Y}}^{te}$ , i.e., equalized odds (1) holds, then the output p-value $p_{v}$ of Algorithm 2 is valid, satisfying $\mathbb{P}\{p_{v}\leq\alpha\}\leq\alpha$ for any desired Type $I$ error rate $\alpha\in[0,1]$ when $H_{0}$ is true.

Input: Data $({\mathbf{X}}^{te},{\mathbf{A}}^{te},{\mathbf{Y}}^{te})=\{(\hat{Y}_{i},A_{i},Y% _{i})\}$ , $1\leq i\leq n_{\mathrm{test}}$

Parameter: the number of synthetic copies $K$ .

1:Compute the test statistic

T

on the test set:

t^{*}=T(\hat{\mathbf{Y}}^{te},{\mathbf{A}}^{te},{\mathbf{Y}}^{te})

2:for

k=1,\dots,K

3: Generate permuted copy

\tilde{\mathbf{A}}_{k}

{\mathbf{A}}^{te}

by (7) (using the sampler described in Appendix B)

4: Compute the test statistic

T

using fake copy on the test set:

t^{(k)}=T(\hat{\mathbf{Y}}^{te},\tilde{\mathbf{A}}_{k},{\mathbf{Y}}^{te})

5:end for

6:Compute the

p

-value:

p_{v}=\frac{1}{K+1}\left(1+\sum_{k=1}^{K}\mathbb{I}\left[t^{*}\geq{t}^{(k)}% \right]\right)

Output: A $p$ -value $p_{v}$ for the hypothesis that equalized odds (1) holds.

Algorithm 2 ICP Test for Equalized Odds

We note that a similar hypothesis test for equalized odds is proposed in Romano et al. (2020) which is done by using a resampled version of $\tilde{A}$ and choosing $T$ in Algorithm 2 as described in Holdout Permutation Test (Tansey et al., 2022), which is based on a predictor $\hat{r}(A,Y)$ aiming to predict $\hat{Y}$ and is formulated as the empirical risk (e.g., mean squared error). However, such $T(\hat{Y},A,Y)$ chosen in Tansey et al. (2022) itself cannot serve as an accurate dependence measure as KPC does.

4 Experiments

In this section, we conduct numerical experiments to examine the effectiveness of the proposed approach on both synthetic datasets and real datasets.²²2The code is available at https://github.com/yuhenglai/FairICP All the details are included in Appendix D.

4.1 Experiments on synthetic datasets

4.1.1 Synthetic data generation

In this section, we explore the performance of FairICP in simulations with a continuous response $Y\in{\mathbb{R}}$ , and potentially multiple sensitive attributes are differently involved by two mechanisms:

•

Simulation 1: The response $Y$ depends on two set of features $X^{*}\in{\mathbb{R}}^{K}$ and $X^{\prime}\in{\mathbb{R}}^{K}$ :

	$\displaystyle Y\sim\mathcal{N}\left(\Sigma_{k=1}^{K}X_{k}^{*}+\Sigma_{k=1}^{K}% X_{k}^{\prime},\sigma^{2}\right),$
	$\displaystyle X_{1:K}^{*}\sim\mathcal{N}(\sqrt{w}A_{1:K},(1-w)\mathbf{I}_{K}),$		(Sim1)
	$\displaystyle X_{1:K}^{\prime}\sim\mathcal{N}(\mathbf{0}_{K},\mathbf{I}_{K}),$

•

Simulation 2: The response $Y$ depends on two features $X^{*}\in{\mathbb{R}}$ and $X^{\prime}\in{\mathbb{R}}$ :

	$\displaystyle Y\sim\mathcal{N}\left(X^{*}+X^{\prime},\sigma^{2}\right),$
	$\displaystyle X^{*}\sim\mathcal{N}(\sqrt{w}A_{1},1-w),$		(Sim2)
	$\displaystyle X^{\prime}\sim\mathcal{N}(0,1),$

•

$Y$ is influenced by multiple sensitive attributes $A_{1:K}$ in the setting Sim1 and influenced by a sole sensitive attribute $A_{1}$ in the setting Sim2. The parameter $w\in[0,1]$ controls the dependence of the predictive feature on $A$ , and we consider $w=0.9$ as a high dependence scenario and $w=0.6$ as a low dependence scenario in our experiments.

•

In both settings, all sensitive attributes are generated independently from a mixture of Gamma distributions to increase the difficulty of estimating $A$ :

\displaystyle A_{k}\sim\frac{1}{2}Gamma(1,1)+\frac{1}{2}Gamma(1,10),

where $k=1,\ldots,K$ for setting Sim1 and $k=1,\ldots,K+1$ for setting Sim2.

We compare the proposed method FairICP to FDL and an oracle version of FairICP where $Y|A$ is given as the true conditional density. These synthetic experiments are where we can reliably evaluate the violation of the equalized odds condition of different methods. We are interested in 1) investigating if FairICP is more effective than FDL as the number of noisy attributes increases (increased $K$ ) by considering the easier problem of estimating the density of $Y|A$ rather than $A|Y$ ; and 2) evaluating if KPC is a good measure for conditional dependence in the sense that it can capture the relative degree of violation of equalized odds when applying different methods to the same data sets.

4.1.2 Results on synthetic datasets

We compare FairICP with $P(Y\mid A)$ estimated by MAF (Papamakarios et al., 2017)), FDL with $P(A\mid Y)$ estimated by MAF, and the oracle version of FairICP with true density. For the measure of the violation of equalized odds, we calculate the empirical KPC $=\hat{\rho}^{2}(\hat{Y},A\mid Y)$ as R Package KPC with Gaussian kernel and default parameters (Huang et al., 2022). Apart from the KPC measure itself, we also consider a second evaluation metric using a hypothesis test as outlined by Algorithm 2 with $T=KPC$ , where we consider the power of rejecting the null hypothesis at level $0.05$ as a measure of conditional dependence when utilizing the underlying true conditional density. The greater $\hat{\rho}^{2}$ or rejection power indicates stronger conditional dependence between $A$ and $\hat{Y}$ given $Y$ . Note that, in Sim2 only $A_{1}$ influences the $Y$ , so the test will be based on $\hat{\rho}^{2}(\hat{Y},A_{1}\mid Y)$ to exclude the effects of noise (though the training is based on $A_{1:K+1}$ for all methods to demonstrate the performance under noise).

Figure 2 and 3 show the trade-off curves between prediction loss and degree of fairness violations measured by KPC or its associated fairness testing power by Algorithm 2 with $T=KPC$ under settings Sim1 and Sim2 respectively, with $K\in\{1,5,10\}$ under the high-dependence scenario $w=0.9$ (Results with low dependence on A are shown Appendix D.1). We implemented $f$ as linear model and $d$ as neural network, and all methods being compared are trained with different penalty parameter $\mu\in[0,1]$ to show the trade-off. In both simulations, the trade-off by Pareto fronts is based on 100 independent runs with a sample size of 500 for the training set and 400 for the test set.

Figure 2 shows the results from the setting Sim1. Models from all three methods reduce to a plain linear regression without regard to fairness when $\mu=0$ , resulting in low prediction loss but a severe violation of equalized odds (evidenced by large KPC and statistical power); as $\mu$ goes larger, models pay more attention to fairness (lower KPC and power) by sacrificing more prediction loss. FairICP (proposed) performs very closely to the oracle model while outperforming FDL as the dimension of $K$ gets larger using both the KPC measure and the power measure, which fits our expectation and follows from the increased difficulty of estimating the conditional density of $A|Y$ . FairICP shows a noticeable but still less performance reduction compared to the oracle model measured by KPC when the dimension of $A$ is 10, which is already large compared to what is examined in the current literature. Of note, this slight difference does not show up when measured by the power, likely due to an information loss when dichotomies the continuous KPC measure into the 0-1 decision given the $p$ -value cutoff.

Figure 3 shows the results from setting Sim2 and delivers a similar message as Figure 2. The gaps between FairICP and FDL are wider compared to the results in Figure 2 as $K$ increases, which echos less percent of information about $A$ needed for estimating $Y|A$ in setting Sim2.

Note that the power measure depends on how the permutation/sampling is conducted in practice, and its reliability hinges on the correctness of the sampling scheme, and thus, the accuracy of density estimation. In contrast, the direct KPC (Kernel-based Pearson Correlation) measure is independent of density estimation. Therefore, we can trust the power evaluation in our synthetic experiments, as we have utilized true conditional density estimation. The consistency between KPC measures and the power measures in our synthetic experiments suggests that KPC is a reasonable and density-estimation-free measure in real applications for comparing different learning methods

Results for when $X$ has a weaker dependence on $A$ with $w=0.6$ are in Appendix D.1, which demonstrated the same message as from Figure 3 and Figure 2 with larger discrepancies across methods.

4.2 Real-data experiments

We consider real-world cases where we may need to protect more than one sensitive attribute. For all the experiments, we split the data into a training set (60%) and a test set (40%), and all the results shown are based on the test set.

4.2.1 Fair regression

In the Communities and Crime dataset ³³3Available at the UC Irvine Data Repository http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime, each record describes the aggregate demographic properties of a different U.S. community; the data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. The total number of records is 1994, and the number of features is 122. Our task here is to predict the number of violent crimes per population for US cities while protecting all race information to avoid biasing police checks depending on the ethnic characteristics of the community. Specifically, we take three minority race information in this dataset into account (African American, Hispanic, Asian) as sensitive attributes instead of only one kind of race as done in the previous literature. We also consider the case where $A$ only includes one race (African American) as in Romano et al. (2020); Mary et al. (2019) for better comparison. All the sensitive attributes used here are continuous, representing the percentage of the population of certain races.

We compare our proposed methods with FDL (Romano et al., 2020) and HGR (Mary et al., 2019)⁴⁴4In Mary et al. 2019, since their implementation doesn’t directly apply to multiple sensitive attributes, we set the mean of three HGR coefficient of each attribute as penalty.. Note that we don’t include sensitive attributes as features in our experiments as in Romano et al. (2020); Mary et al. (2019). We consider neural networks as predictor $f$ in all methods⁵⁵5We also consider $f$ as a linear model in Appendix D.2, and we tune the hyperparameters as in (Romano et al., 2020) (see details in Appendix D).

We present our results as Pareto front in Fig 4 to show the trade-off curves of prediction and fairness given by our method and the state-of-the-art methods where the fairness is measured by both KPC and the power from the statistical test for fairness as outlined by Algorithm 2 with $T$ chosen as KPC. We see that both metrics give similar trends: although there are some small discrepancies between using KPC and the fairness test, we observe that FairICP outperforms FDL and HGR especially when both three sensitive attributes are considered. Although the conditional density is now estimated and the fairness test might suffer from it, KPC is a robust measure regardless of the sampling scheme for $\tilde{A}$ .

4.2.2 Fair classification

We then turn to a binary classification case that has been well-studied and considers two categorical sensitive attributes. The dataset we consider is ProPublica’s COMPAS recidivism data (5278 examples) ⁶⁶6Although it’s widely used in fairness-related literature, recently there have been critiques about the limitations of this dataset (Bao et al., 2022).. The task is to predict recidivism from someone’s criminal history, jail and prison time, demographics, and COMPAS risk scores. We choose two binary protected attributes $A$ : race (white vs. non-white) and sex. For this special task (binary classification against multiple binary sensitive attributes), we compare FairICP to two baselines HGR (Mary et al., 2019) and Exponentiated-gradient reduction (Agarwal et al., 2018), with the later developed for this particular kind of task. We aim to use this example to demonstrate the ability of FairICP to handle categorical observations and provide comparable performance with regard to the more tailored approach.

In addition, apart from KPC and the corresponding fairness test, we also consider another fairness metric based on confusion matrix (Hardt et al., 2016; Cho et al., 2020) designed for such a binary classification task with categorical sensitive attributes to quantify equalized odds:

\displaystyle\mathrm{DEO}:=

\displaystyle\sum_{y\in\{0,1\}}\sum_{z\in\mathcal{Z}}|\operatorname{Pr}(\hat{Y% }=1\mid Z=z,Y=y)-\operatorname{Pr}(\hat{Y}=1\mid Y=y)|,

(8)

where $\hat{Y}$ is the predicted class label.

Similar to the regression case, we train neural network models as classifiers and discriminators ⁷⁷7We also consider $f$ as a linear model in Appendix D.2 (see details in Appendix D).

Figure 5 shows that all three methods behave similarly overall in this classification example regarding their prediction-fairness trade-offs, with FairICP closely matching the performance of the exponential-gradient reduction (referred to as Reduction) using all three fairness evaluation metrics, and HGR slightly worse than FairICP and Reduction when measured by DEO.

5 Discussion

We introduced a flexible fairness learning approach, FairICP, to address the challenge of achieving equalized-odds fairness with complex sensitive attributes. FairICP combines adversarial learning with a novel inverse conditional permutation (ICP) strategy and offers a flexible and effective solution for handling sensitive attributes that may be of mixed data types and multidimensional in nature. We provided theoretical insights into the proposed method, elucidating the underpinning concepts and the rationale behind integrating ICP with adversarial learning. Furthermore, we conducted numerical experiments on both synthetic and real data to support our theoretical insights and demonstrate the efficacy and flexibility of our proposed method. We also noted that the majority of the computational burden for FairICP lies in training the adversarial prediction model, based on our experience (as also mentioned in Zhang et al. (2018); Romano et al. (2020)), with that from the density estimation and ICP sampling being negligible in comparison. The scalability challenge of the adversarial techniques should be more carefully addressed by implementing more efficient methods, which we view as a future direction for improving FairICP

References

Agarwal et al. (2018) Agarwal, A., A. Beygelzimer, M. Dudík, J. Langford, and H. Wallach (2018). A reductions approach to fair classification. In International conference on machine learning, pp. 60–69. PMLR.
Bao et al. (2022) Bao, M., A. Zhou, S. Zottola, B. Brubach, S. Desmarais, A. Horowitz, K. Lum, and S. Venkatasubramanian (2022). It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks.
Berrett et al. (2020) Berrett, T. B., Y. Wang, R. F. Barber, and R. J. Samworth (2020). The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society Series B: Statistical Methodology 82(1), 175–197.
Candès et al. (2018) Candès, E., Y. Fan, L. Janson, and J. Lv (2018). Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B 80(3), 551–577.
Castelnovo et al. (2022) Castelnovo, A., R. Crupi, G. Greco, D. Regoli, I. G. Penco, and A. C. Cosentini (2022). A clarification of the nuances in the fairness metrics landscape. Scientific Reports 12(1), 4209.
Cho et al. (2020) Cho, J., G. Hwang, and C. Suh (2020). A fair classifier using kernel density estimation. Advances in neural information processing systems 33, 15088–15099.
Creager et al. (2019) Creager, E., D. Madras, J.-H. Jacobsen, M. Weis, K. Swersky, T. Pitassi, and R. Zemel (2019, 09–15 Jun). Flexibly fair representation learning by disentanglement. In K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, Volume 97 of Proceedings of Machine Learning Research, pp. 1436–1445. PMLR.
Dwork et al. (2012) Dwork, C., M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012). Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214––226.
Feldman et al. (2015) Feldman, M., S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015). Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 259–268.
Goodfellow et al. (2014) Goodfellow, I., J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680.
Hardt et al. (2016) Hardt, M., E. Price, and N. Srebro (2016). Equality of opportunity in supervised learning. Advances in neural information processing systems 29.
Hebert-Johnson et al. (2018) Hebert-Johnson, U., M. Kim, O. Reingold, and G. Rothblum (2018, 10–15 Jul). Multicalibration: Calibration for the (Computationally-identifiable) masses. In J. Dy and A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research, pp. 1939–1948. PMLR.
Huang et al. (2022) Huang, Z., N. Deb, and B. Sen (2022). Kernel partial correlation coefficient — a measure of conditional dependence. Journal of Machine Learning Research 23(216), 1–58.
Kamishima et al. (2011) Kamishima, T., S. Akaho, and J. Sakuma (2011). Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650. IEEE.
Kearns et al. (2018) Kearns, M., S. Neel, A. Roth, and Z. S. Wu (2018, 10–15 Jul). Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In J. Dy and A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, Volume 80 of Proceedings of Machine Learning Research, pp. 2564–2572. PMLR.
Keener (2010) Keener, R. W. (2010). Theoretical statistics: Topics for a core course. Springer.
Kim et al. (2018) Kim, M. P., A. Ghorbani, and J. Zou (2018). Multiaccuracy: Black-box post-processing for fairness in classification.
Kusner et al. (2017) Kusner, M. J., J. Loftus, C. Russell, and R. Silva (2017). Counterfactual fairness. In Advances in Neural Information Processing Systems 30, pp. 4066–4076.
Mary et al. (2019) Mary, J., C. Calauzenes, and N. El Karoui (2019). Fairness-aware learning for continuous attributes and treatments. In International Conference on Machine Learning, pp. 4382–4391.
Mehrabi et al. (2021) Mehrabi, N., F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan (2021). A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 54(6), 1–35.
Naaman (2021) Naaman, M. (2021). On the tight constant in the multivariate dvoretzky–kiefer–wolfowitz inequality. Statistics & Probability Letters 173, 109088.
Papamakarios et al. (2017) Papamakarios, G., T. Pavlakou, and I. Murray (2017). Masked autoregressive flow for density estimation. Advances in neural information processing systems 30.
Romano et al. (2020) Romano, Y., S. Bates, and E. Candes (2020). Achieving equalized odds by resampling sensitive attributes. Advances in neural information processing systems 33, 361–371.
Scott (1991) Scott, D. W. (1991). Feasibility of multivariate density estimates. Biometrika 78(1), 197–205.
Tang and Zhang (2022) Tang, Z. and K. Zhang (2022). Attainability and optimality: The equalized odds fairness revisited. In Conference on Causal Learning and Reasoning, pp. 754–786. PMLR.
Tansey et al. (2022) Tansey, W., V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei (2022). The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics 31(1), 151–162.
Yang et al. (2022) Yang, J., A. A. S. Soltan, Y. Yang, and D. A. Clifton (2022). Algorithmic fairness and bias mitigation for clinical machine learning: Insights from rapid covid-19 diagnosis by adversarial learning. medRxiv.
Zafar et al. (2017) Zafar, M. B., I. Valera, M. Gomez Rodriguez, and K. P. Gummadi (2017). Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, pp. 1171––1180.
Zafar et al. (2019) Zafar, M. B., I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi (2019). Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research 20(75), 1–42.
Zemel et al. (2013) Zemel, R., Y. Wu, K. Swersky, T. Pitassi, and C. Dwork (2013). Learning fair representations. In International conference on machine learning, pp. 325–333. PMLR.
Zhang et al. (2018) Zhang, B. H., B. Lemoine, and M. Mitchell (2018). Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340.

Appendix A Proofs

Proof of Theorem 2.1.

Let $S({\mathbf{A}})=\{A_{1},\ldots,A_{n}\}$ denote the row set of the observed $n$ realizations of sensitive attributes (unordered and duplicates are allowed). Let ${\mathbf{X}}$ , $\hat{\mathbf{Y}}\coloneqq f({\mathbf{X}})$ and ${\mathbf{Y}}$ be the associated $n$ feature, prediction, and response observations.

Taks 1: Show that $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})\overset{d}{=}(\hat{\mathbf{Y}},% \tilde{\mathbf{A}},{\mathbf{Y}})$ given conditional independence $\hat{Y}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2.0% mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptscriptstyle\perp$% \hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A|Y$ .

Proof of Task 1. Recall that conditional on $S({\mathbf{A}})=S$ for some $S=\{a_{1},...,a_{n}\}$ , we have (Berrett et al., 2020):

\mathbb{P}\left\{{\mathbf{A}}={\mathbf{a}}_{\pi}\right|S({\mathbf{A}})=S,{% \mathbf{Y}}\}=\frac{q^{n}_{A|Y}\left({\mathbf{a}}_{\pi}\mid{\mathbf{Y}}\right)% }{\sum_{\pi^{\prime}\in\mathcal{S}_{n}}q_{A|Y}^{n}\left({\mathbf{a}}_{\pi^{% \prime}}\mid{\mathbf{Y}}\right)},

(9)

where ${\mathbf{a}}=(a_{1},...,a_{n})$ is the stacked $a$ values in $S$ . On the other hand, conditional on $S(\tilde{\mathbf{A}})=S$ , by construction:

\displaystyle\mathbb{P}\left\{\tilde{\mathbf{A}}={\mathbf{a}}_{\pi}|S({\mathbf% {A}})=S,{\mathbf{Y}}\right\}=\frac{q_{Y|A}^{n}\left({\mathbf{Y}}_{\pi^{-1}}% \mid{\mathbf{a}}\right)}{\sum_{\pi^{\prime}}q^{n}_{Y|A}\left({\mathbf{Y}}_{{% \pi^{\prime}}^{-1}}|{\mathbf{a}}\right)}=\frac{q_{A|Y}^{n}\left({\mathbf{a}}_{% \pi}\mid{\mathbf{Y}}\right)}{\sum_{\pi^{\prime}}q_{A|Y}^{n}\left({\mathbf{a}}_% {\pi}\mid{\mathbf{Y}}\right)}

(10)

where the last equality utilizes the following fact,

\displaystyle\frac{q_{Y|A}^{n}\left({\mathbf{Y}}_{\pi^{-1}}\mid{\mathbf{a}}% \right)}{\sum_{\pi^{\prime}}q^{n}_{Y|A}\left({\mathbf{Y}}_{{\pi^{\prime}}^{-1}% }|{\mathbf{a}}\right)}=\frac{q^{n}_{Y,A}\left({\mathbf{Y}}_{\pi^{-1}},{\mathbf% {a}}\right)}{\sum_{\pi^{\prime}\in\mathcal{S}_{n}}q^{n}_{Y,A}\left({\mathbf{Y}% }_{{\pi^{\prime}}^{-1}},{\mathbf{a}}\right)}=\frac{q^{n}_{Y,A}\left({\mathbf{Y% }},{\mathbf{a}}_{\pi}\right)}{\sum_{\pi^{\prime}\in\mathcal{S}_{n}}q^{n}_{Y,A}% \left({\mathbf{Y}},{\mathbf{a}}_{\pi^{\prime}}\right)}=\frac{q_{A|Y}^{n}\left(% {\mathbf{a}}_{\pi}\mid{\mathbf{Y}}\right)}{\sum_{\pi^{\prime}}q_{A|Y}^{n}\left% ({\mathbf{a}}_{\pi^{\prime}}\mid{\mathbf{Y}}\right)}.

Consequently, under the conditional independence assumption, we can write the joint distribution of $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})$ as the following ( $\hat{{\mathbf{y}}}$ , ${\mathbf{y}}$ are some stacked observation values $\hat{y}_{1},\ldots,\hat{y}_{n}$ and $y_{1},\ldots,y_{n}$ for $\hat{\mathbf{Y}}$ and ${\mathbf{Y}}$ respectively):

$\displaystyle\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},{\mathbf{A}}={% \mathbf{a}}_{\pi},{\mathbf{Y}}={\mathbf{y}})$	$\displaystyle=\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},{\mathbf{A}}={% \mathbf{a}}_{\pi}\mid{\mathbf{Y}}={\mathbf{y}})\cdot\mathbb{P}({\mathbf{Y}}={% \mathbf{y}})$
	$\displaystyle\overset{(b_{1})}{=}\mathbb{P}(\hat{\mathbf{Y}}=\hat{y}\mid{% \mathbf{Y}}={\mathbf{y}})\cdot\mathbb{P}({\mathbf{A}}={\mathbf{a}}_{\pi}\mid{% \mathbf{Y}}={\mathbf{y}})\cdot\mathbb{P}({\mathbf{Y}}={\mathbf{y}})$
	$\displaystyle=\mathbb{P}(\hat{\mathbf{Y}}=\hat{y}\mid{\mathbf{Y}}={\mathbf{y}}% )\cdot\mathbb{E}_{S}\left[\mathbb{P}({\mathbf{A}}={\mathbf{a}}_{\pi}\mid{% \mathbf{Y}}={\mathbf{y}},S({\mathbf{A}})=S)\right]\cdot\mathbb{P}({\mathbf{Y}}% =y)$
	$\displaystyle\overset{(b_{2})}{=}\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}}% \mid{\mathbf{Y}}={\mathbf{y}})\cdot\mathbb{E}_{S}\left[\mathbb{P}(\tilde{% \mathbf{A}}={\mathbf{a}}_{\pi}\mid{\mathbf{Y}}={\mathbf{y}},S(\tilde{\mathbf{A% }})=S)\right]\cdot\mathbb{P}({\mathbf{Y}}={\mathbf{y}})$
	$\displaystyle=\mathbb{E}_{S}\left[\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}}% \mid{\mathbf{Y}}={\mathbf{y}})\cdot\mathbb{P}(\tilde{\mathbf{A}}={\mathbf{a}}_% {\pi}\mid{\mathbf{Y}}={\mathbf{y}},S(\tilde{\mathbf{A}})=S)\right]\cdot\mathbb% {P}({\mathbf{Y}}={\mathbf{y}})$
	$\displaystyle\overset{(b_{3})}{=}\mathbb{E}_{S}\left[\mathbb{P}(\hat{\mathbf{Y% }}=\hat{\mathbf{y}},\tilde{\mathbf{A}}={\mathbf{a}}_{\pi}\mid{\mathbf{Y}}={% \mathbf{y}},S(\tilde{\mathbf{A}})=S)\right]\cdot\mathbb{P}({\mathbf{Y}}={% \mathbf{y}})$
	$\displaystyle=\mathbb{P}(\hat{\mathbf{Y}}=\hat{{\mathbf{y}}},\tilde{\mathbf{A}% }={\mathbf{a}}_{\pi}\mid{\mathbf{Y}}={\mathbf{y}})\cdot\mathbb{P}({\mathbf{Y}}% ={\mathbf{y}})$
	$\displaystyle=\mathbb{P}(\hat{\mathbf{Y}}=\hat{{\mathbf{y}}},\tilde{\mathbf{A}% }={\mathbf{a}}_{\pi},{\mathbf{Y}}={\mathbf{y}}).$	(11)

Here, step $(b_{2})$ has used eq. (9) and eq. (10), which establishes the equivalence between the condition law of $A$ and $\tilde{A}$ ; steps $(b_{1})$ and $(b_{3})$ relies on the conditional independence relationships $A\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{% \displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}\mkern 2.0% mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptscriptstyle\perp$% \hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\hat{Y}|Y$ and $\tilde{A}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2% .0mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}% \mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$% \hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$% \scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\hat{Y}|Y$ . Hence, conditional independence indicate the distributional equivalence $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})\overset{d}{=}(\hat{\mathbf{Y}},% \tilde{\mathbf{A}},{\mathbf{Y}})$ .

Taks 2: Show the further conditioned conditional independence $\hat{Y}\mathchoice{\mathrel{\hbox to 0.0pt{$\displaystyle\perp$\hss}\mkern 2.0% mu{\displaystyle\perp}}}{\mathrel{\hbox to 0.0pt{$\textstyle\perp$\hss}\mkern 2% .0mu{\textstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptstyle\perp$\hss}\mkern 2% .0mu{\scriptstyle\perp}}}{\mathrel{\hbox to 0.0pt{$\scriptscriptstyle\perp$% \hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}A|Y,S({\mathbf{A}})$ given $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})\overset{d}{=}(\hat{\mathbf{Y}},% \tilde{\mathbf{A}},{\mathbf{Y}})$ .

Proof of Task 2. When $\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},{\mathbf{A}}={\mathbf{a}},{% \mathbf{Y}}={\mathbf{y}})=\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},\tilde{% \mathbf{A}}={\mathbf{a}},{\mathbf{Y}}=y)$ , we have

$\displaystyle\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},{\mathbf{A}}={% \mathbf{a}}\mid{\mathbf{Y}}={\mathbf{y}},S({\mathbf{A}})=S)$	$\displaystyle=\frac{\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},{\mathbf{A}}=% {\mathbf{a}},{\mathbf{Y}}={\mathbf{y}})}{\mathbb{P}({\mathbf{Y}}={\mathbf{y}},% S({\mathbf{A}})=S)}$
	$\displaystyle=\frac{\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},\tilde{% \mathbf{A}}={\mathbf{a}},{\mathbf{Y}}={\mathbf{y}})}{\mathbb{P}({\mathbf{Y}}={% \mathbf{y}},S({\mathbf{A}})=S)}$
	$\displaystyle=\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}},\tilde{\mathbf{A}}=% {\mathbf{a}}\mid{\mathbf{Y}}={\mathbf{y}},S({\mathbf{A}})=S)$
	$\displaystyle\overset{(b_{1})}{=}\mathbb{P}(\hat{\mathbf{Y}}=\hat{\mathbf{y}}% \mid{\mathbf{Y}}={\mathbf{y}},S({\mathbf{A}})=S)\mathbb{P}(\tilde{\mathbf{A}}=% {\mathbf{a}}\mid{\mathbf{Y}}={\mathbf{y}},S({\mathbf{A}})=S)$
	$\displaystyle\overset{(b_{2})}{=}\mathbb{P}(\hat{\mathbf{Y}}=\hat{y}\mid{% \mathbf{Y}}={\mathbf{y}},S({\mathbf{A}})=S)\mathbb{P}({\mathbf{A}}={\mathbf{a}% }\mid{\mathbf{Y}}={\mathbf{y}},S({\mathbf{A}})=S)$	(12)

Here, step $(b_{1})$ holds by the construction of $\tilde{\mathbf{A}}$ , while step $(b_{2})$ holds as a result of eq. (9) and eq. (10).

Taks 3: Show the asymptotic equalized odds given $(\hat{\mathbf{Y}},{\mathbf{A}},{\mathbf{Y}})\overset{d}{=}(\hat{\mathbf{Y}},% \tilde{\mathbf{A}},{\mathbf{Y}})$ .

Proof of Task 3. We prove this statement utilizing the previous statement and known multi-dimensional c.d.f (cumulative distribution function) estimation bound (Naaman, 2021). Let $t^{1}$ and $t^{2}$ be constant vectors of the same dimensions as $\hat{Y}$ and $A$ , and $t^{3}$ be a constant vector of the same dimension as $Y$ . Construct augmented matrix ${\mathbf{t}}^{1}$ , ${\mathbf{t}}^{2}$ , ${\mathbf{t}}^{3}$ where ${\mathbf{t}}_{1.}^{1}=t^{1}$ , ${\mathbf{t}}_{1.}^{2}=t^{2}$ and $t_{i.}^{1}=\infty$ , $t_{i.}^{2}=\infty$ for $i=2,\ldots,n$ , and ${\mathbf{t}}^{3}_{i.}=t_{3}$ the same for all $i=1,\ldots,n$ . Let $(\hat{Y}_{1},A_{1},Y_{1})$ be a from the same distribution as $(\hat{Y},A,Y)$ . Then,

	$\displaystyle\mathbb{P}\left(\hat{Y}_{1}\leq t^{1},A_{1}\leq t^{2}\|Y_{1}=t^{3}\right)$	$\displaystyle\overset{(b_{1})}{=}\mathbb{P}\left(\hat{Y}_{1}\leq t^{1},A_{1}% \leq t^{2}\|{\mathbf{Y}}={\mathbf{t}}^{3}\right)$
		$\displaystyle=\frac{\mathbb{P}\left(\hat{Y}_{1}\leq t^{1},A_{1}\leq t^{2},{% \mathbf{Y}}={\mathbf{t}}^{3}\right)}{\mathbb{P}({\mathbf{Y}}={\mathbf{t}}^{3})}$
		$\displaystyle\overset{(b_{2})}{=}\frac{\mathbb{P}\left(\hat{\mathbf{Y}}\leq{% \mathbf{t}}^{1},{\mathbf{A}}\leq{\mathbf{t}}^{2},{\mathbf{Y}}={\mathbf{t}}^{3}% \right)}{\mathbb{P}({\mathbf{Y}}={\mathbf{t}}^{3})},$

where step $(b_{1})$ has used the fact that $(X_{i},A_{i},Y_{i})$ , for $i=1,\ldots,n$ are independently generated, thus, conditioning on additional independent $Y_{2},\ldots,Y_{n}$ does not change the probability; step $(b_{2})$ holds because ${\mathbf{t}}^{1}_{i.}$ and ${\mathbf{t}}^{2}_{i.}$ , for $i=2,\ldots,n$ , take infinite values and do not modify the event considered. Utilizing eq. (12), have further have

		$\displaystyle\frac{\mathbb{P}\left(\hat{\mathbf{Y}}\leq{\mathbf{t}}^{1},{% \mathbf{A}}\leq{\mathbf{t}}^{2},{\mathbf{Y}}={\mathbf{t}}^{3}\right)}{\mathbb{% P}({\mathbf{Y}}={\mathbf{t}}^{3})}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{S\|{\mathbf{Y}}}\left[\mathbb{P}\left(\hat{\mathbf{Y}}% \leq{\mathbf{t}}^{1},{\mathbf{A}}\leq{\mathbf{t}}^{2}\|{\mathbf{Y}}={\mathbf{t}% }^{3},S({\mathbf{A}})=S\right)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{S\|{\mathbf{Y}}}\left[\mathbb{P}\left(\hat{\mathbf{Y}}% \leq{\mathbf{t}}^{1}\|{\mathbf{Y}}={\mathbf{t}}^{3},S(\tilde{\mathbf{A}})=S% \right)\mathbb{P}\left(\tilde{\mathbf{A}}\leq{\mathbf{t}}^{2}\|{\mathbf{Y}}={% \mathbf{t}}^{3},S(\tilde{\mathbf{A}})=S\right)\right]$
	$\displaystyle\overset{(b_{3})}{=}$	$\displaystyle\mathbb{E}_{S\|{\mathbf{Y}}}\left[\mathbb{P}\left(\hat{Y}_{1}\leq t% ^{1}\|{\mathbf{Y}}={\mathbf{t}}^{3},S(\tilde{\mathbf{A}})=S\right)\mathbb{P}% \left(\tilde{A}_{1}\leq t^{2}\|{\mathbf{Y}}={\mathbf{t}}^{3},S(\tilde{\mathbf{A% }})=S\right)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{S\|{\mathbf{Y}}}\left[\mathbb{P}\left(\hat{Y}_{1}\leq t% ^{1}\|Y_{1}={\mathbf{t}}^{3},S({\mathbf{A}})=S\right)\mathbb{P}\left(A_{1}\leq t% ^{2}\|{\mathbf{Y}}={\mathbf{t}}^{3},S({\mathbf{A}})=S\right)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{P}\left(\hat{Y}_{1}\leq t^{1}\|Y_{1}=t^{3}\right)\mathbb{P% }\left(A_{1}\leq t^{2}\|Y_{1}=t^{3}\right)+\Delta$

where step $(b_{3})$ has used again the fact that ${\mathbf{t}}^{1}_{i.}=\infty$ and ${\mathbf{t}}^{2}_{i.}=\infty$ , for $i=2,\ldots,n$ , and $\Delta$ is defined as

	$\displaystyle\Delta$	$\displaystyle=\mathbb{E}_{S\|{\mathbf{Y}}}\left[\mathbb{P}\left(\hat{Y}_{1}\leq t% ^{1}\|{\mathbf{Y}}={\mathbf{t}}^{3},S({\mathbf{A}})=S\right)\left(\mathbb{P}% \left(A_{1}\leq t^{2}\|{\mathbf{Y}}={\mathbf{t}}^{3},S({\mathbf{A}})=S\right)-% \mathbb{P}\left(A_{1}\leq t^{2}\|Y_{1}=t^{3}\right)\right)\right],$
		$\displaystyle=\mathbb{E}_{S\|{\mathbf{Y}}}\left[\mathbb{P}\left(\hat{Y}_{1}\leq t% ^{1}\|{\mathbf{Y}}={\mathbf{t}}^{3},S({\mathbf{A}})=S\right)\left(\mathbb{P}% \left(A_{1}\leq t^{2}\|{\mathbf{Y}}={\mathbf{t}}^{3},S({\mathbf{A}})=S\right)-% \mathbb{P}\left(A_{1}\leq t^{2}\|Y_{1}=t^{3}\right)\right)\right]$

Our goal is equivalent to bound $|\Delta|$ . Notice that since ${\mathbf{t}}_{1.}^{3}=\ldots={\mathbf{t}}_{n.}^{3}=t^{3}$ are the same for all $n$ samples, $A_{1}$ , $\ldots$ , $A_{n}$ are exchangeable given $S({\mathbf{A}})=S$ . Consequently, we obtain that

	$\displaystyle\|\Delta\|$	$\displaystyle\leq\mathbb{E}_{S\|{\mathbf{Y}}}\left[\|\mathbb{P}\left(A_{1}\leq t% ^{2}\|{\mathbf{Y}}={\mathbf{t}}^{3},S({\mathbf{A}})=S\right)-\mathbb{P}\left(A_% {1}\leq t^{2}\|Y_{1}=t^{3}\right)\|\right]$
		$\displaystyle=\sum_{S}\|\mathbb{P}\left(A_{1}\leq t^{2}\mid{\mathbf{Y}}={% \mathbf{t}}^{3},S({\mathbf{A}})=S\right)\mathbb{P}\left(S({\mathbf{A}})=S\|{% \mathbf{Y}}={\mathbf{t}}^{3}\right)-\mathbb{P}\left(A_{1}\leq t^{2}\|Y_{1}=t^{3% }\right)\mathbb{P}\left(S({\mathbf{A}})=S\|{\mathbf{Y}}={\mathbf{t}}^{3}\right)\|$
		$\displaystyle\overset{(b_{4})}{=}\sum_{S}\|\hat{F}_{y}^{S}(t^{2})-F_{y}(t^{2})\|% \mathbb{P}\left(S({\mathbf{A}})=S\|{\mathbf{Y}}={\mathbf{t}}^{3}\right),$

where step $(b_{4})$ has used the equivalence of $A_{1},..,A_{n}$ , which leads to $\mathbb{P}\left(A_{1}\leq t^{2}|{\mathbf{Y}}={\mathbf{t}}^{3},S({\mathbf{A}})=% S\right)$ the $S$ -induced empirical c.d.f evaluated at $t^{2}$ . Also, $S$ is a set $n$ samples $A$ generated conditional on $Y=t^{3}$ , and $\hat{F}_{t^{3}}^{S}(.)$ denotes the empirical c.d.f induced by $S$ and $F_{t^{3}}(.)$ denote the theoretical c.d.f of $A|Y=t^{3}$ . From Lemma 4.1 in (Naaman, 2021), which generalizes Dvoretzky–Kiefer–Wolfowitz inequality to multi-dimensional empirical c.d.f to we know

\mathbb{P}(\sup_{t^{2}}|\hat{F}_{t^{3}}^{S}(t^{2})-F_{t^{3}}(t^{2})|>\delta)% \leq p(n+1)\exp(-2nt^{2}).

Combine this equality with the bound for $|\Delta|$ , we have

\mathbb{P}(|\Delta|>C\frac{\log p+\log n}{n})\rightarrow 0,

for a sufficiently large $C$ as $n\rightarrow\infty$ . We thus reached our conclusion that

\lim_{n\rightarrow\infty}\left[\mathbb{P}\left(\hat{Y}\leq t^{1},A_{1}\leq t^{% 2}|Y=t^{3}\right)-\mathbb{P}\left(\hat{Y}\leq t^{1}|Y=t^{3}\right)\mathbb{P}% \left(A\leq t^{2}|Y=t^{3}\right)\right]\rightarrow 0,

∎

Proof of Theorem 2.3.

For fixed $f$ , the optimal discriminator $D^{*}$ is reached at

\hat{\theta}^{*}_{d}=\arg\min_{\theta_{d}}\mathcal{L}_{d}\left(\theta_{f},% \theta_{d}\right),

in which case, the discriminating classifier is $D_{\theta^{*}_{d}}(\cdot)=\dfrac{p_{\hat{Y}AY}(\cdot)}{p_{\hat{Y}AY}(\cdot)+p_% {\hat{Y}\tilde{A}Y}(\cdot)}$ (See Proposition 1 in (Goodfellow et al., 2014)), and $\mathcal{L}_{d}$ reduces to

\displaystyle\mathcal{L}_{d}\left(\theta_{f},\theta_{d}\right)

\displaystyle=\log(4)-2\cdot JSD\left(p_{\hat{Y}AY}\|~{}p_{\hat{Y}\tilde{A}Y}\right)

where $JSD$ is the Jensen-Shannon divergence between the distributions of $(\hat{Y},A,Y)$ and $(\hat{Y},\tilde{A},Y)$ . Plug this this into $V_{\mu}(\theta_{f},\theta_{d})$ , we reach the single-parameter form of the original objective:

\displaystyle V_{\mu}\left(\theta_{f}\right)

\displaystyle=\min_{\theta_{d}}V_{\mu}(\theta_{f},\theta_{d})=(1-\mu)\mathcal{% L}_{f}\left(\theta_{f}\right)+2\mu\cdot JSD\left(p_{\hat{Y}AY}\|~{}p_{\hat{Y}% \tilde{A}Y}\right)-\mu\log(4)\geq(1-\mu)H(Y\mid X)-\mu\log(4),

where the equality holds at $\theta^{*}=\arg\min_{\theta_{f}}V\left(\theta_{f}\right)$ . In summary, the solution value $(1-\mu)H(Y\mid X)-\mu\log(4)$ is achieved when:

•

$\hat{\theta}_{f}$ minimizes the negative $\log$ -likelihood of $Y\mid X$ under $f$ , which happens when $\hat{\theta}_{f}$ are the solutions of an optimal predictor $f$ . In this case, $\mathcal{L}_{f}$ reduces to its minimum value $H(Y\mid X)$
•

$\hat{\theta}_{f}$ minimizes the Jensen-Shannon divergence $JSD\left(p_{\hat{Y}AY}\|~{}p_{\hat{Y}\tilde{A}Y}\right)$ , Since the Jensen–Shannon divergence between two distributions is always non-negative, and zero if and only if they are equal.

The second characterization is equivalent to the condition $(\hat{Y}AY)\,{\buildrel d\over{=}}\,(\hat{Y}\tilde{A}Y)$ . Note that this is a population level characterization with $\mathbb{E}$ corresponding to the case where $n\rightarrow\infty$ . As a result, by the asymptotic equalized odds statement in Theorem 2.1, we have that $\hat{f}_{\hat{\theta}_{f}}$ also satisfies equalized odds. ∎

Proof of Proposition 3.2.

The proposed test is a special case of the Conditional Permutation Test (Berrett et al., 2020), so the proof is a direct result from Theorem 2.1 in our paper and Theorem 1 in (Berrett et al., 2020) .

∎

Appendix B Sampling Algorithm

To sample the permutation $\Pi$ from the probabilities:

\mathbb{P}\left\{\Pi=\pi\mid{\mathbf{A}},{\mathbf{Y}}\right\}=\frac{q^{n}\left% ({\mathbf{Y}}_{{\pi}^{-1}}\mid{\mathbf{A}}\right)}{\sum_{\pi^{\prime}\in% \mathcal{S}_{n}}q^{n}\left({\mathbf{Y}}_{{\pi^{\prime}}^{-1}}\mid{\mathbf{A}}% \right)},

we use the Parallelized pairwise sampler for the CPT proposed in Berrett et al. (2020), which is detailed as follows:

Input: Data $({\mathbf{A}},{\mathbf{Y}})$ , Initial permutation $\Pi^{[0]}$ , integer $S\geq 1$ .

1:for

s=1,\dots,S

2: Sample uniformly without replacement from

\{1,\ldots,n\}

to obtain disjoint pairs

\left(i_{s,1},j_{s,1}\right),\ldots,\left(i_{s,\lfloor n/2\rfloor},j_{s,% \lfloor n/2\rfloor}\right).

3: Draw independent Bernoulli variables

B_{s,1},\ldots,B_{s,\lfloor n/2\rfloor}

with odds ratios

\frac{\mathbb{P}\left\{B_{s,k}=1\right\}}{\mathbb{P}\left\{B_{s,k}=0\right\}}=% \frac{q\left(Y_{\left(\Pi^{[s-1]}\left(j_{s,k}\right)\right)}\mid A_{i_{s,k}}% \right)\cdot q\left(Y_{\left(\Pi^{[s-1]}\left(i_{s,k}\right)\right.}\mid A_{j_% {s,k}}\right)}{q\left(Y_{\left(\Pi^{[s-1]}\left(i_{s,k}\right)\right)}\mid A_{% i_{s,k}}\right)\cdot q\left(Y_{\left(\Pi^{[s-1]}\left(j_{s,k}\right)\right)}% \mid A_{j_{s,k}}\right)}.

Define

\Pi^{[s]}

by swap**

\Pi^{[s-1]}\left(i_{s,k}\right)

and

\Pi^{[s-1]}\left(j_{s,k}\right)

for each

k

with

B_{s,k}=1

4:end for

Output: Permuted copy $\tilde{\mathbf{A}}={\mathbf{A}}_{{\Pi^{[S]}}^{-1}}$ .

Algorithm 3 Parallelized pairwise sampler for the ICP

Appendix C Additional comparisons of CP/ICP

When we know the true conditional laws $q_{Y|A}(.)$ (conditional density $Y$ given $A$ ) and $q_{A|Y}(.)$ (conditional density $A$ given $Y$ ), both CP and ICP show provide accurate conditional permutation copies. However, both densities are estimated in practice, and the estimated densities are denoted as $\check{q}_{Y|A}(.)$ and $\check{q}_{A|Y}(.)$ respectively. The density estimation quality will depend on both the density estimation algorithm and the data distribution. While a deep dive into this aspect, especially from the theoretical aspects, is beyond the scope, we provide some additional heuristic insights to assist our understanding of the potential gain of ICP over CP.

When ICP might improve over CP?

According to proof argument of Theorem 4 in Berrett et al. (2020), let ${\mathbf{A}}_{\pi_{m}}$ be some permuted copies of $A$ according to the estimated conditional law $\check{q}_{A|Y}()$ , an upper bound of exchangeability violation for ${\mathbf{A}}$ and ${\mathbf{A}}_{\pi}$ is related to the total variation between the estimated density $\check{q}_{A|Y}(.)$ and $q_{A|Y}(.)$ (Theorem 4 in Berrett et al. (2020)):

		$\displaystyle d_{TV}\{(({\mathbf{Y}},{\mathbf{A}}),({\mathbf{Y}},{\mathbf{A}}_% {\pi}))\|{\mathbf{Y}}),(({\mathbf{Y}},\check{{\mathbf{A}}}),({\mathbf{Y}},{% \mathbf{A}}_{\pi}))\|{\mathbf{Y}})\}$
	$\displaystyle\leq$	$\displaystyle d_{TV}(\prod_{i=1}^{n}\check{q}_{A\|Y}(.\|y_{i}),\prod_{i=1}^{n}q_% {A\|Y}(.\|y_{i}))\overset{(b_{1})}{\leq}\sum_{i=1}^{n}d_{TV}(\check{q}_{A\|Y}(.\|y% _{i}),q_{A\|Y}(.\|y_{i})),$		(13)

where step $(b_{1})$ is from Lemma (B.8) from ghosal2017fundamentals. We adapt the proof arguments of Theorem 4 in Berrett et al. (2020) to the ICP procedure.

Specifically, let ${\mathbf{Y}}_{\pi}$ be the conditional permutation of ${\mathbf{Y}}$ according to $\check{q}_{Y|A}(.)$ and $\check{{\mathbf{Y}}}$ be a new copy sampled according to $\check{q}_{Y|A}(.)$ . We will have

\displaystyle d_{TV}\{(({\mathbf{Y}},{\mathbf{A}}),({\mathbf{Y}}_{\pi},{% \mathbf{A}})|{\mathbf{A}})\}\leq\sum_{i=1}^{n}d_{TV}(\check{q}_{Y|A}(.|A_{i}),% q_{Y|A}(.|A_{i})).

(14)

There is one issue before we can compare the two CP and ICP upper bounds for exchangeability violations: the two bounds consider different variables and conditioning events. Notice that we care only about the distributional level comparisons, hence, we can apply permutation $\pi^{-1}$ to $({\mathbf{Y}},{\mathbf{A}})$ and $({\mathbf{Y}},{\mathbf{A}}_{\pi^{-1}})$ . The resulting $({\mathbf{Y}}_{\pi^{-1}},{\mathbf{A}}_{\pi^{-1}})$ is equivalent to $({\mathbf{Y}},{\mathbf{A}})$ and the resulting $({\mathbf{Y}},{\mathbf{A}}_{\pi^{-1}})$ is exactly the ICP conditionally permuted version. Next we can remove the conditioning event by marginalizing out ${\mathbf{Y}}$ and ${\mathbf{A}}$ in (C) and (14) respectively. Hence, we obtain upper bounds for violation of exchangeability using CP and ICP permutation copies, which is smaller for ICP if $\check{q}_{Y|A}(.)$ is more accurate on average:

\mathbb{E}_{A}\left[d_{TV}(\check{q}_{Y|A}(.|A),q_{Y|A}(.|A))\right]<\mathbb{E% }_{Y}\left[d_{TV}(\check{q}_{A|Y}(.|Y),q_{A|Y}(.|Y))\right].

ICP achieved higher quality empirically

To illustrate that ICP can provide resampling distribution closer to that of the oracle conditional permutation compared to CP, both utilizing off-the-shelf tools for density estimation with varying dimensions, we consider the following examples:

(1)let $A=(U_{1},\ldots,U_{K_{0}},U_{K_{0}+1},\ldots,U_{K_{0}+K})\times\Theta^{\frac{1% }{2}}$ . Here $U_{j}$ be independently generated from either the standard normal $\mathcal{N}(0,1)$ or a mixed Gamma distribution $\frac{1}{2}\Gamma(1,1)+\frac{1}{2}\Gamma(1,10)$ ; $\Theta$ is a randomly generated covariance matrix with eigenvalues equally spaced on $[1,5]$ .

(2) let $Y\sim N(\sqrt{.5}\sum_{k=1}^{K_{0}}A_{j},1.5)$ . That is, $Y$ only influenced by first $K_{0}$ columns of $A_{j}$ , with the next $K$ columns of $A$ be noise.

(3) We estimate $q_{Y|A}$ / $q_{A|Y}$ using (1) lasso regression/graphical lasso, where we estimate the linear dependence of $Y$ on $A$ and variance empirically for $q_{Y|A}$ and estimate $q_{A|Y}$ assuming joint normality of $(Y,A)$ . For $K_{0}=1$ and $K=0$ , OLS was used for both estimations, and (2) MAF, which was default in our paper.

We set $K_{0}\in\{1,5\}$ , $K\in\{0,5,10,20\}$ , and the sample size for density estimation and evaluating the conditional permutation distribution to both be 200. We are interested in the total variation difference between permutations using ICP and CP using the estimated densities to that using the true density, which is explicitly known in this example up to a normalization constant.

Due to the large permutation space, the calculation of the actual total variation distance is difficult. To circumvent this challenge, we restrict the permutation space to swap** actions: we consider the TV distance ( $\log 10$ transformed) restricted to permutations $\pi$ that swaps $i$ and $j$ for $i\neq j,i,j=1,\ldots,n$ and the original order, and compare ICP and CP to the oracle conditional permutations on such $\frac{n(n+1)}{2}$ permutations only.

Figure 6 and Figure 7 show results using (1) MAF and (2) cross-validated lasso regression or graphical lasso, respectively (repeated 20 times for each setting). We see that the TV distances between ICP and the oracle are smaller than the corresponding ones for CP using both density estimation approaches. MAF is a default density estimation approach for general purposes. By design, lasso regression/OLS is favored over MAF for estimating $q_{Y|A}$ in this particular example. There may be better density estimation choices in other applications, but overall, estimating $Y|A$ can be simpler and allows us to utilize existing tools, e.g., those designed for supervised learning.

Appendix D Experiments

In both simulation studies and real-data experiments, we implement the algorithms with the hyperparameters chosen by the tuning procedure as in Romano et al. (2020). In practice, we tune the hyperparameters only once using 10-fold cross-validation on the entire data set and then treat the chosen set as fixed for the rest of the experiments. Then we compare the performance metrics of the different algorithms on 100 data splits that are different than the ones used to tune the parameters. This same tuning scheme is used for all methods, ensuring that the comparisons are meaningful.

D.1 Experiments on synthetic datasets

For all the models evaluated (FairICP, Oracle, FDL), we set the hyperparameters as follows:

•

We set $f$ as a linear model and use the Adam optimizer with a mini-batch size in {16, 32, 64}, learning rate in {1e-4, 1e-3, 1e-2}, and the number of epochs in {20, 40, 60, 80, 100, 120, 140, 160, 180, 200}. The discriminator is implemented as a four-layer neural network with a hidden layer of size 64 and ReLU non-linearities. We use the Adam optimizer, with a fixed learning rate of 1e-4.

D.1.1 Low sensitive attribute dependence for Sim1

We report the results with A-dependence $w=0.6$ here:

D.1.2 Low sensitive attribute dependence for Sim2