Mind the Graph When Balancing Data for Fairness or Robustness

Jessica Schrouff
Google DeepMind
[email protected]
&Alexis Bellot
Google DeepMind
&Amal Rannen-Triki
Google DeepMind
&Alan Malek
Google DeepMind
&Isabela Albuquerque
Google DeepMind
&Arthur Gretton
Google DeepMind
Gatsby Computational Neuroscience Unit
&Alexander D’Amour
Google DeepMind
&Silvia Chiappa
Google DeepMind

Abstract

Failures of fairness or robustness in machine learning predictive settings can be due to undesired dependencies between covariates, outcomes and auxiliary factors of variation. A common strategy to mitigate these failures is data balancing, which attempts to remove those undesired dependencies. In this work, we define conditions on the training distribution for data balancing to lead to fair or robust models. Our results display that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies in a causal graph of the task, leading to multiple failure modes and even interference with other mitigation techniques such as regularization. Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.

1 Introduction

When training prediction models, practitioners often desire that the model’s outputs display safety properties in addition to high performance, such as being fair across demographic subgroups [29, 50] or being robust to distribution shifts [e.g. 19, 58]. These objectives can be difficult to attain if there are undesired dependencies between covariates $X$ , labels $Y$ , and auxiliary factors of variation $Z$ , such as confounding factors or hidden stratification [26, 27]. A commonly referenced example is that of an animal classification task from wildlife pictures [e.g. 63]: the model might identify patterns in the background of the images that are indicative of the type of animal (e.g. the presence of snow for polar bears or grass for cows), which might lead to the model failing to recognize the same animal when it is on another background. When the auxiliary factors relate to demographic attributes, the deployment of such models can have societal implications, e.g. patients not being assigned medical resources due to factors related to race [53].

Multiple mitigation strategies have been proposed to remove undesired dependencies pre-, in- or post-processing. Amongst them, balancing the training data is typically considered a straightforward approach and has been used or researched in various settings [e.g. 37, 38, 59, 8, 33, 39, 2]. This approach modifies the training distribution, indicated with $P(X,Y,Z)$ , into a new, balanced distribution (which we refer to as $Q(X,Y,Z)$ ) that aims to approximate an ‘idealized’ training distribution in which the undesired dependencies are absent [47, 14, 76]. Models are then trained on this balanced distribution to attain different fairness or robustness criteria. A popular approach to construct a balanced distribution is by balancing classes (resp. groups), leading to a uniform distribution over $Y$ (resp. $Z$ ). While successful for addressing failures of robustness [e.g. 33] or of fairness due to under-representation of certain groups [e.g. 74], this approach does not induce independence between $Y$ and $Z$ . To approximate independence, a ‘joint’ balancing on $(Y,Z)$ is often performed [e.g. 47, 8]. Joint balancing can be implemented by matching the numbers of samples in all $(y,z)$ groups (only feasible when $Y$ and $Z$ have small, discrete domains) via subsampling the majority groups [e.g. 8], upsampling the minority groups [e.g. 62], resampling the data with weights proportional to $P(Y)P(Z)/P(Y,Z)$ , or reweighting the loss [9]. Our work focuses on joint balancing given its suitability to mitigate a marginal dependence between $Y$ and $Z$ .¹¹1We briefly discuss group or class data balancing in Appendix A.1. While the choice of the method for jointly balancing can impact the results [11, 64, 33], these methods can be seen as modifying $P$ as described in Definition 1.1.

Definition 1.1 (Jointly balanced distribution).

We say that the distribution $Q(X,Y,Z)$ is a jointly balanced version of $P(X,Y,Z)$ if $Q(X,Y,Z)=P(X,Y,Z)\frac{P(Y)P(Z)}{P(Y,Z)}$ .

In some cases, data balancing has proven to be an effective mitigation strategy for undesired dependencies, performing on-par with other, more complex mitigation techniques [33]. Recently, data balancing has also shown promises for mitigation during fine-tuning or partial retraining [40, 43, 48, 78, 74], which is relevant to the settings of training large-scale models and with large amounts of data. Nevertheless, data balancing has also displayed failure modes in which the obtained models were not fair, robust or optimal [75, 47, 57, 2]. These failure modes have not been thoroughly characterized and can be difficult to predict. Furthermore, the impact of data balancing on other mitigation strategies has not been studied extensively.

Given data balancing’s popularity as a baseline mitigation strategy for undesired dependencies, we aim to formalize some of its promises and pitfalls. Our analysis relies on a causal graphical framework, which allows investigating the impact of data balancing in different data generating processes. Our contributions can be summarized as follows:

•

We display failure modes of data balancing in semi-synthetic tasks and highlight how predicting these failures can be challenging.
•

We introduce necessary and sufficient conditions for data balancing to attain invariance to undesired dependencies as defined by fairness or robustness criteria.
•

We prove that data balancing does not correspond to ‘removing’ undesired dependencies from a causal perspective, and can negatively impact fairness or robustness criteria when combined with regularization strategies.
•

We illustrate how our findings can be used to distinguish between failure modes and identify next steps.

2 Preliminaries

Let $X$ , $Y$ , $Z$ be random variables with ${X\in\mathcal{X}}$ corresponding to a set of covariates (e.g. tabular, images or text), $Y\in\mathcal{Y}$ to an outcome to be predicted, and $Z\in\mathcal{Z}$ to an auxiliary factor of variation, such as a sensitive attribute or the type of background of an image, that displays statistical dependence with $Y$ in the original, training distribution $P(X,Y,Z)$ . We consider a prediction model $f:\mathcal{X}\rightarrow\mathcal{Y}$ that is trained on data from distribution $P(X,Y,Z)$ to minimize the risk $R_{P}(f):=\operatorname{\mathbb{E}}_{X,Y\sim P}[\ell(f;X,Y)]$ where $\ell$ is a loss function. We call $f\in\mathcal{F}$ optimal on $P$ if the risk attains the minimum for $P$ .

Definition 2.1 (Optimality).

A prediction model $f\in\mathcal{F}$ is optimal on $P$ if $f=\arg\!\min_{f^{\prime}\in\mathcal{F}}R_{P}(f^{\prime})$ .

2.1 Desired criteria on a model’s predictions

While a model may be optimal on $P$ , it might not be optimal on another distribution of interest $P^{\prime}(X,Y,Z)$ (e.g. in deployment), and/or might display disparities across subsets of the data (e.g. $P(X,Y\,|\,Z=z)$ ) [22]. To mitigate this issue, multiple safety criteria have been defined in the fields of fairness and robustness.

Fairness: Fairness criteria can be defined in terms of the dependence between the model’s output $f(X)$ and the auxiliary factor of variation $Z$ . We consider established fairness criteria [5, 50], including demographic parity [ $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}Z$ , 23], equalized odds [ $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ , 29] and predictive parity [ $Y\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,f(X)$ , 24]. Beyond fairness of $f(X)$ , we also consider fairness of intermediate representations $\phi(X)$ , e.g. $\phi(X)\mathrel{\perp\mspace{-10.0mu}\perp}Z$ [80], for their usage in downstream tasks.

Robustness: In this field, the focus is typically on finding models $f_{\theta}$ parameterized by $\theta\in\Theta$ that provide the lowest risk across a family of target distributions $\mathcal{P}$ . For instance, the ‘worst group performance’ criterion aims to select parameters such that the performance on a ‘worst’ distribution $P^{\prime}$ is optimized, i.e. $\theta^{*}=\min_{\theta\in\Theta}\{\sup_{P^{\prime}\in\mathcal{P}}R_{P^{\prime% }}(f_{\theta})\}$ [6, 20]. $\mathcal{P}$ can be defined so that each distribution $P^{\prime}$ represents a specific subpopulation [63], to minimize the loss in each subgroup, or aiming for an invariance of $R_{P^{\prime}}$ across subgroups [risk-invariance 47].

Definition 2.2 (Risk-invariance).

A prediction model $f$ is risk-invariant w.r.t. the family of distributions $\mathcal{P}$ if $R_{P^{\prime}}(f)=R_{P^{\prime\prime}}(f)$ $\forall P^{\prime},P^{\prime\prime}\in\mathcal{P}$ .

If a model is optimal on $P$ and risk-invariant w.r.t. $\mathcal{P}$ , it is also optimal w.r.t. $\mathcal{P}$ . The choice of $\mathcal{P}$ is context-specific and reflects some domain knowledge about shifts that are likely to arise in a given application. For instance, a plausible family of target distributions could imply a shift in the dependence between $Y$ and $Z$ , also known as a correlation shift [61], and be expressed as $\mathcal{P}=\{P^{\prime}(X,Y,Z)=P(X\,|\,Y,Z)P^{\prime}(Z\,|\,Y)P(Y),\forall P^% {\prime}(Z\,|\,Y)\}$ . Alternatively, we can define $\mathcal{P}$ using a causal framework (see Section 2.2) when the data generation process is known [47].

We acknowledge that selecting amongst those criteria is context-dependent and do not advocate for a specific choice. We call a prediction model $f$ invariant to undesired dependencies, denoted with $f\in\mathcal{F}_{inv}$ , if it satisfies one of such criteria. For brevity, we focus on risk-invariance in the main text and consider fairness criteria in Appendix. Obtaining an invariant model can be performed in different ways, with data balancing being a popular approach.

2.2 Causal framework to analyse data balancing

To understand the effects of data balancing, we need to investigate its impact on the distribution $P$ . A causal formalization is useful for studying how distributions change under different interventions. To analyse the implications of data balancing, we use the framework of causal Bayesian networks (CBNs) [e.g. 70, 13, 51, 73, 25, 47]. A Bayesian network [54, 55, 15, 41] is a pair $\langle\mathcal{G},P\rangle$ , in which $\mathcal{G}$ is a directed acyclic graph whose nodes $X^{1},\ldots,X^{D}$ represent random variables and in which $P$ is a joint distribution over the nodes. The absence of edges in $\mathcal{G}$ implies a set of statistical independence assumptions satisfied by $P$ that can be expressed by the factorization $P(X^{1},\dots,X^{D})=\prod_{d=1}^{D}P(X^{d}\,|\,{\text{pa}}(X^{d}))$ , where ${\text{pa}}(X^{d})$ denote the parents of $X^{d}$ , namely the nodes with an edge into $X^{d}$ (we say that $P$ factorizes according to $\mathcal{G}$ ). A CBN is a Bayesian network in which an edge expresses causal influence, so that ${\text{pa}}(X^{d})$ are direct causes of $X^{d}$ . A directed path between $X^{i}$ and $X^{j}$ in a CBN is also called a causal path. A non-directed path, also called non-causal path, expresses statistical dependence of non-causal nature. We refer to the statistical dependence between $X^{i}$ and $X^{j}$ that arises only due to the presence of non-causal paths as purely spurious. In our setting $X^{1}\cup\dots\cup X^{D}=X\cup Y\cup Z\cup\mathbf{U}$ where $\mathbf{U}$ are unobserved variables. Inspired by prior work [73, 3, 69, 76], we make the following assumption on the form of the covariates $X$ .

(a) Anti-causal
Purely spurious

(b) Causal
Purely spurious

(d) Anti-causal
Entangled data

Figure 1: Examples of CBNs with undesired dependencies between

Y

and

Z

displayed by red edges. Light gray indicates unobserved variables.

X_{Y\wedge Z}=\emptyset

in (a-b) and there is no entanglement between

Y

and

Z

via

X

. In (c), we expand the system to include

V\in\mathbf{U}

and its influence on

X

, which is given by

X_{V}

Assumption 2.3 (Form of Covariates $X$ ).

In the system defined by $X\cup Y\cup Z\cup U$ with $U\in\mathbf{U}$ , $X$ decomposes as $X=X^{\perp}_{Z}\cup X^{\perp}_{Y}\cup X_{Y\wedge Z}$ , where $X^{\perp}_{Z}$ is a function of $X$ that does not have causal paths to/from $Z$ but has causal paths to/from $Y$ , $X^{\perp}_{Y}$ is a function of $X$ that does not have causal paths to/from $Y$ but has causal paths to/from $Z$ , and $X_{Y\wedge Z}$ is a function of $X$ that has causal paths to/from both $Y$ and $Z$ , representing entangled signals.

In the animal classification example, $X^{\perp}_{Z}$ would correspond to the animal pixels, $X^{\perp}_{Y}$ to the background pixels (e.g. snowy or grassy landscape), and $X_{Y\wedge Z}$ to characteristics of the animal that depend on its environment (e.g. color of the fur pixels in bears). Intuitively, we want to build a prediction model $f$ that only depends on the animal pixels. While the decomposition may be readily available when a causal graph of the application is available and the data is tabular, we typically do not have direct access to the different functions of $X$ and these need to be isolated algorithmically.

Following Schölkopf et al. [65], we consider both the case in which $X^{\perp}_{Z}\cup X_{Y\wedge Z}$ are direct causes of the label $Y$ (causal task) e.g. estimating the helpfulness of a text review, and the case in which $Y$ is a direct cause of $X^{\perp}_{Z}\cup X_{Y\wedge Z}$ (anti-causal task) as in object detection tasks in computer vision. Figures 1(a-b) display examples of anti-causal and causal tasks with a purely spurious dependence between $Y$ and $Z$ . It is important to note that statistical relationships between the different variables and functions of $X$ are determined by the graph: for instance, in Figure 1(a) $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ , while in Figure 1(b) $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z$ .

Based on a CBN of the task and Assumption 2.3, we characterize undesired dependencies as the presence of undesired paths between $Z$ and $Y$ , which we indicate through red edges (Figure 1). Based on this depiction of undesired dependencies, we can define the family of target distributions $\mathcal{P}$ such that black edges are preserved, but those in red may lead to changes in the distribution. For the anti-causal task in Figure 1(a), we can hence write $\mathcal{P}=\{P^{\prime}(Y,Z,X)=P(Y)P^{\prime}(Z\,|\,Y)P(X^{\perp}_{Z}\,|\,Y)P% (X^{\perp}_{Y}\,|\,Z)\}$ in which $P^{\prime}(Z\,|\,Y)$ represents any distribution but all other causal mechanisms are fixed [47], which corresponds to a correlation shift.

3 Can we predict when data balancing fails?

As reported previously, data balancing can display failure modes, e.g. due to the presence of other confounders [75, 2], finite sampling effects [47] or a dependence between $Y$ and $Z$ when conditioning on $X$ ( $Y\centernot{\perp}Z\,|\,X$ ) [57]. However, this list is non-exhaustive and, to the best of our knowledge, there is no unifying study of those failure modes or of how they could be mitigated. In this section, we perform joint data balancing on different tasks to illustrate that successes and failures of this approach can be difficult to predict. For details of the experiments, see Appendix D.

Let’s first consider semi-synthetic examples generated from the graphs in Figure 1(a,b), i.e. an anti-causal and causal task with a purely spurious correlation. We aim to obtain a risk-invariant and optimal model on these tasks by training on the jointly balanced distribution $Q$ .

Anti-causal task: number detection in MNIST. Inspired by Brown et al. [8], we modify MNIST images [44, 17] by adding a factor of variation $Z$ such that the top of the image is replaced by red noise for $Z=0$ and blue noise for $Z=1$ (Figure 2). We sample a dataset in which the factor of variation and label are dependent ( $P(Y=0\,|\,Z=0)=0.95$ , $P(Y=1\,|\,Z=0)=0.10$ , called the ‘confounded’ data), a jointly balanced dataset, and a dataset from a distribution $P^{0}$ in which the undesired dependency is absent ( $P^{0}(Z=0\,|\,Y)=0.5$ ). We train convolutional networks to predict whether the number in an image is smaller or larger than 5, assessing the models on their training distribution and on $P^{0}$ .

Models trained with confounded data (95/10) display biased outputs (Table 1), with low worst group performance and high equalized odds. Performance on $P^{0}$ is also lower compared to that on $P$ ( $0.937\pm 0.002$ ), showing that these models are not risk-invariant w.r.t. $\mathcal{P}$ . Models trained from balanced data obtain high overall performance and worst group accuracy, as well as low equalized odds. In addition, we were not able to decode $Z$ from the model representation $\phi(X)$ , suggesting that the model has not learned $X^{\perp}_{Y}$ .²²2This result is interesting as an addition across the channels of the raw image allows to discriminate red from blue samples, and colors can easily be discriminated from a model trained to predict $Z$ from scratch (accuracy=100%). We therefore show that the model is not performing any ‘incidental’ learning of $X^{\perp}_{Y}$ . Our results suggest that data balancing led to a fair/robust and optimal model.

Causal task: helpfulness of reviews with Amazon reviews [52]. Inspired by Veitch et al. [73], we refer to the causal task of predicting the helpfulness rating of an Amazon review (thumbs up or down, $Y$ ) from its text ( $X$ ). We add a synthetic factor of variation $Z$ such that words like ‘the’ or ‘my’ are replaced by ‘thexxxx’ and ‘myxxxx’ ( $Z=0$ ) or ‘theyyyy’ and ‘myyyyy’ ( $Z=1$ ). We train a BERT [34] model on a class-balanced version of the data for reference (due to high class imbalance), and compare to a model trained on jointly balanced data, both evaluated on their training distribution and on a distribution $P^{0}$ with no association.

In this case, jointly balancing improves fairness and risk-invariance, with the model’s performance on the training distribution (acc.: $0.574\pm 0.016$ ) being similar to that on $P^{0}$ (Table 1). This however comes at a high performance cost when compared to the class balanced model’s performance on $P$ (acc: $0.658\pm 0.015$ ). Therefore, data balancing might not to lead to optimality for this causal task.

Refer to caption — Figure 2: MNIST data samples.

Task	Dataset	Acc. ( $\uparrow$ )	Worst Grp ( $\uparrow$ )	Encoding ( $\sim 0.5$ )	Equ. Odds ( $\downarrow$ )
Anti-causal (a)	95/10	$0.717\pm 0.027$	$0.380\pm 0.062$	$0.996\pm 0.004$	$0.539\pm 0.015$
Anti-causal (a)	Balanced	$0.880\pm 0.006$	$0.836\pm 0.075$	$0.486\pm 0.005$	$0.018\pm 0.008$
Causal (b)	Class bal.	$0.558\pm 0.015$	$0.092\pm 0.015$	$0.690\pm 0.113$	$0.0.542\pm 0.098$
Causal (b)	Jointly bal.	$0.583\pm 0.017$	$0.399\pm 0.014$	$0.545\pm 0.037$	$0.060\pm 0.046$
Anti-causal (c)	With $V$	$0.769\pm 0.008$	$0.555\pm 0.031$	$0.665\pm 0.134$	$0.094\pm 0.035$
Anti-causal (d)	Entangled	$0.672\pm 0.004$	$0.000\pm 0.001$	$0.881\pm 0.223$	$0.554\pm 0.028$

Using the same framework, we can replicate the failure modes due to another confounder described in Wang et al. [75], Alabdulmohsin et al. [2] as well as that from Puli et al. [57].

Anti-causal task with another factor of variation $V$ . It is common for multiple auxiliary factors to influence the data generating process, and they tend to correlate with each other [e.g. 21]. To emulate this case, we introduce more unobserved variables $U_{2},U_{3}$ as well as a factor of variation $V$ which affects the data through $X_{V}$ (Figure 1(c)).³³3 $X_{V}$ and its dependencies to $(X,Y,Z)$ were selected to describe an example without entangled data, but the results hold for $X_{V}\subset X^{\perp}_{Y}$ . We modify the MNIST data generation to include $X_{V}$ depicted by a green cross on the top left or top right of the image and jointly balance the data on $(Y,Z)$ before training the model. We evaluate the obtained predictor on a distribution where $V$ and $Z$ are not correlated with $Y$ and observe (Table 1) a large gap between worst group accuracy and overall performance, as well as non-null equalized odds. These results suggest that the model is not fair or robust, and also displays a decrease in performance compared to the model trained on data without $X_{V}$ .

Anti-causal task with entangled data. We map the work in Puli et al. [57] to our decomposition of $X$ and propose the example graph in Figure 1(d) where $X_{Y\wedge Z}$ represents an entangled function of $X$ . To match this data generating process, the color of the noise in MNIST samples is defined by $\textsc{OR}(Y,Z)$ and the evaluation distribution is the disentangled $P^{0}$ with no dependence between $Y$ and $Z$ . Once again, the obtained model is not fair, robust or optimal (Table 1). Appendix A.2 discusses this case further.

Motivated by these examples of both success and failures, we define necessary and sufficient conditions for the success of data balancing, and highlight when the cases above fail to meet these conditions.

4 Conditions for data balancing to produce an invariant and optimal model

In this section, we introduce necessary and sufficient conditions that, taken together, lead to a risk-invariant and optimal prediction model $f$ after training on $Q$ (proofs in Appendix B.1). In Appendix B.2, we derive similar conditions for fairness criteria. Throughout the rest of the paper, we use an underscore to indicate under which of $P$ or $Q$ a statistical independence holds, e.g. $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z$ to indicate $P(Y\,|\,Z)=P(Y)$ .

We consider the criterion of risk-invariance (Definition 2.2) under correlation shift, i.e. $\mathcal{P}=\{P^{\prime}(X,Y,X)=P(X|Y,Z)P^{\prime}(Z|Y)P(Y)\}$ . According to our decomposition of $X$ , the risk-minimizing function $f(X):=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]$ should only be a function of $X^{\perp}_{Z}$ and not of $X^{\perp}_{Y}$ or $X_{Y\wedge Z}$ . To achieve this result with data balancing, we build on a prior result by Makar et al. [47], which shows that a model trained on a balanced distribution only depends on $X^{\perp}_{Z}$ if $X^{\perp}_{Z}$ represents a sufficient statistic for $Y$ , i.e. no other part of $X$ influences $Y$ .

Definition 4.1.

(Sufficient Statistic) We say that $X^{\perp}_{Z}$ is a sufficient statistic for $Y$ in $Q$ if $\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X^{% \perp}_{Z}]$ .

Definition 4.1 implies that the risk-minimizing function $f$ for $Q$ does not vary with $X^{\perp}_{Y},X_{Y\wedge Z}$ . However, this condition is not sufficient on its own to ensure that $f$ is risk-invariant w.r.t. $\mathcal{P}$ , as $X^{\perp}_{Z}$ or $Y$ may have non-causal relationships with $Z$ . To ensure optimality and risk-invariance w.r.t. $\mathcal{P}$ , we derive the sufficient condition in Proposition 4.2.

Proposition 4.2.

If $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ and $X^{\perp}_{Z}$ is a sufficient statistic for $Y$ in $Q$ , then the risk-minimizer $f(X):=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]$ is risk-invariant and optimal w.r.t. $\mathcal{P}$ .

The conditions of Proposition 4.2 concern $Q$ . However, it would be of interest to express them in $P$ if it is possible to observe all covariates (e.g. in the case of tabular data). Based on our expression for $Q$ , we can derive sufficient conditions on $P$ , expressed in Corollary 4.3. Let’s denote $\{X^{\perp}_{Y},X_{Y\wedge Z}\}$ by $R$ .

Corollary 4.3.

If $R\mathrel{\perp\mspace{-10.0mu}\perp}_{P}\{Y,X^{\perp}_{Z}\}\,|\,Z$ and $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,Y$ , then the risk-minimizer $f(X):=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]$ is risk-invariant and optimal w.r.t. $\mathcal{P}$ .

In general, we can expect that anti-causal tasks with purely spurious correlations will satisfy these conditions, as per their definition. However, this would not be the case for most causal tasks as $X^{\perp}_{Z}\centernot{\mathrel{\perp\mspace{-10.0mu}\perp}}_{P}Z\,|\,Y$ . This result is in line with our findings in Section 3, as the MNIST data generated from the graph in Figure 1(a) validates Corollary 4.3, but the Amazon reviews data generated from Figure 1(b) does not.

It may be less obvious, but the conditions for a sufficient statistic are not met in Figures 1(c,d) as $X_{V}\centernot{\mathrel{\perp\mspace{-10.0mu}\perp}}_{P}\{Y,X^{\perp}_{Z}\}\,% |\,Z$ in the case of another factor of variation $V$ , and $X_{Y\wedge Z}\centernot{\mathrel{\perp\mspace{-10.0mu}\perp}}_{P}\{Y,X^{\perp}% _{Z}\}\,|\,Z$ in the case of entangled data. We hence see that when a causal graph of the application is available, Corollary 4.3 can provide indicators on when data balancing might succeed or fail.

While Proposition 4.2 and its corollary provide conditions on the data generating process, prior work [e.g. 10, 31] has demonstrated that the learning strategy of $f$ also influences the model’s fairness and robustness characteristics. As data balancing on its own does not control the learning strategy, we need to define conditions on $f$ to ensure risk-invariance and optimality. To this end, we assume that the penultimate representation $\phi(X)$ can be decomposed into $\phi^{\perp}_{Z}(X)$ , $\phi^{\perp}_{Y}(X)$ and $\phi_{Y\wedge Z}(X)$ such that $\phi(X)$ is disentangled, i.e. $\mathbb{E}_{P}^{\prime}[Y\,|\,\phi^{\perp}_{Z}(X)]=\mathbb{E}_{P}^{\prime}[Y\,% |\,X^{\perp}_{Z}]\forall P^{\prime}\in\mathcal{P}$ . We can define the following condition for risk-invariance and optimality of $f$ where $f$ is a linear transformation of $\phi(X)$ .

Proposition 4.4 (Disentangled representation).

Let $\phi(\cdot)$ be disentangled with $\mathbb{E}_{P}^{\prime}[Y\,|\,\phi^{\perp}_{Z}(X)]=\mathbb{E}_{P}^{\prime}[Y\,% |\,X^{\perp}_{Z}]\forall P^{\prime}\in\mathcal{P}$ and $h$ be a linear function. The risk-minimizer $f(X):=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]$ is optimal and risk-invariant w.r.t. $\mathcal{P}$ if $\operatorname{\mathbb{E}}_{P^{\prime}}[Y\,|\,f(X)]=\operatorname{\mathbb{E}}_{% P^{\prime}}[Y\,|\,h(\phi^{\perp}_{Z}(X))]\forall P^{\prime}\in\mathcal{P}$ , $X^{\perp}_{Z}$ is a sufficient statistic for $Y$ in $Q$ and $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ .

In Proposition 4.4, we require that the representation $\phi(X)$ does not ’loose’ information about $X^{\perp}_{Z}$ or mixes it with information from $Z$ . We note that such a representation can be obtained even if the data is entangled, e.g. by drop** modes of variation during training. Unlike other strategies [4, 47, 57], data balancing cannot enforce this property on its own and a disentangled representation is considered as necessary. This condition hence suggests another failure mode of data balancing when the conditions on the data are validated, but the representation is of low quality. We believe this failure mode is displayed in Kirichenko et al. [40], as the success of their data balancing mitigation only holds when using models pre-trained on large datasets.

In this section, we have identified conditions for data balancing to be successful. In the next section, we go one step further to understand how data balancing impacts the data generating process, and how it interacts with other mitigation strategies for undesired dependencies, focusing on regularization.

5 Impact of data balancing on the CBN

Joint data balancing is assumed to remove statistical dependence between $Y$ and $Z$ while kee** other relationships in the CBN of the task unaffected [e.g. 47, 76, 14]. This could be interpreted as ‘drop**’ edges in the undesired paths in $\mathcal{G}$ , e.g. removing the influence of $U$ on $Y$ and/or $Z$ in Figure 1(a), leading to a new graph $\mathcal{G}^{0}$ . While this interpretation is correct for joint balancing in the case of Figure 1(a), Proposition 5.1 below (proof in Appendix C) shows that it can be erroneous in general: the distribution $Q$ underlying the balanced data might not factorize according to $\mathcal{G}^{0}$ and therefore might not obey the statistical dependence relationships implied by $\mathcal{G}^{0}$ . Therefore, balancing data to make $Z$ and $Y$ statistically independent, i.e. selecting samples in proportion to $P(Z)P(Y)/P(Z,Y)$ , is not equivalent to generating data from a distribution that factorises according to $\mathcal{G}^{0}$ in general. This factorization is important because downstream distributions $P^{\prime}(X,Y,Z)$ are often assumed to follow this factorization; in fact, this assumption underlies a number recommendations for applying regularization methodologies such as in [73].

Proposition 5.1.

Let $\langle\mathcal{G},P\rangle$ be the CBN underlying the data, where $\mathcal{G}$ contains an undesired path between $Z$ and $Y$ , and let $\mathcal{G}^{0}$ be a modification of $\mathcal{G}$ in which the undesired path has been removed. The distribution $Q$ obtained by jointly balancing the data need not factorize according to $\mathcal{G}^{0}$ .

Proposition 5.1 shows that statistical (in)dependencies that we assumed would remain fixed (i.e. the black edges on the graph) can be modified by the process of joint balancing. As a consequence, further interventions on $Q$ (e.g. the addition of a regularizer) should not be motivated by $\mathcal{G}^{0}$ , and we show below that combining data balancing with other mitigation strategies can lead to unexpected results.

5.1 Data balancing can hinder regularization and vice-versa

When confronted with a failure mode, it is reasonable to ask whether an additional fairness or robustness regularizer might be beneficial. Based on Proposition 5.1, we see that this question might have a different answer if we are in $P$ or in $Q$ . Below, we consider each failure mode and ask whether performing an additional regularization motivated by the literature would mitigate the undesired dependencies in $Q$ . In Appendix C.1.2, we discuss when balancing with regularization is sufficient for different fairness criteria.

Anti-causal task. In the case of an anti-causal task with a dependence between $Y$ and $Z$ (Figures 1(a,c,d)), Veitch et al. [73] recommend to impose an independence between $f(X)$ and $Z$ conditioned on $Y$ . If we consider both the purely spurious correlation and the entangled case, we see that regularization and data balancing would have the same effects of blocking any dependence between $\{Y,X^{\perp}_{Z}\}$ and $\{Z,X^{\perp}_{Y},X_{Y\wedge Z}\}$ . We demonstrate that $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ in both $P$ and $Q$ (see Appendix C.1), and this regularization is sensible under both distributions. This means that performing the regularization provides the sufficient conditions for a risk-invariant model, whether or not joint data balancing is performed. In theory, data balancing is not needed but is also not harmful. In the case of an added confounder, we have that $X_{V}$ depends on both $Y$ and $Z$ due to non-causal paths through $V$ . Therefore, imposing that $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\mid Y$ might lead to results whereby the model only depends on $V$ or is trivial (e.g. predicts a constant) as the regularization encourages the removal of any dependence on $Z$ , which is related to $Y$ via $X_{V}$ . This behavior would be observed in both $P$ and $Q$ , but data balancing on its own might be less detrimental than regularization in terms of predictive power even though it does not resolve all undesired dependencies. In this case, regularization hinders data balancing.

Based on the balanced data from Section 3, we add a conditional Maximum Mean Discrepancy [MMD, 28] to encourage $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ during training, varying the strength of this regularizer via a hyper-parameter. In the case of the purely spurious statistical dependence between $Y$ and $Z$ (Figure 1(a)), there is little variation between the metrics across MMD strengths, and the model is fair and robust (Figure 3(left)). In the entangled case (Figure 3(right)), the model’s performance on $Q$ and $P^{0}$ are close for medium values of the hyper-parameter (before MMD overpowers the training) and worst group performance improves markedly. This result suggests that, with the added regularizer, $f$ only varies with $X^{\perp}_{Z})$ . Performing the same regularization in the presence of another confounder (Figure 3(middle)) leads to a plateau in performance on $Q$ , but low performance on $P^{0}$ and chance-level worst group performance. In this case, we posit that the model relies exclusively on $X_{V}$ for its predictions, and the regularizer is detrimental compared to data balancing on its own (MMD=0 on the plot).

Causal task. Finally, let us consider the causal task in Figure 1(b). In a similar case, Veitch et al. [73] suggests a regularizer such that $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z$ , which would encourage the model $f(X)$ to vary only with $X^{\perp}_{Z}$ as $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z$ . However, data balancing induces a dependence between $X^{\perp}_{Z}$ and $Z$ , as expressed below:

$\displaystyle\begin{aligned} Q(X^{\perp}_{Z}\,|\,Z)=\frac{\sum_{X^{\perp}_{Y},% Y}P(X^{\perp}_{Z},X^{\perp}_{Y}\,|\,Z,Y)P(Z)P(Y)}{\sum_{X^{\perp}_{Y},X^{\perp% }_{Z},Y}P(X^{\perp}_{Z},X^{\perp}_{Y}\,|\,Z,Y)P(Z)P(Y)}=\sum_{Y}P(X^{\perp}_{Z% }\,|\,Z,Y)P(Y)\end{aligned},$

The RHS cannot be simplified further because $X^{\perp}_{Z}\not\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\mid Y$ , because $Y$ is a collider under $P$ . Thus, the left hand side is a function of $Z$ in general (see Appendix C.1 for further details and a numerical simulation). In this case, regularizing to enforce $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ would destroy information in $X^{\perp}_{Z}$ , whereas the same regularization under $P$ would have enabled $f(X)$ to use all of the information in $X^{\perp}_{Z}$ . Therefore, data balancing may hinder regularization.

We illustrate this result on the Amazon reviews dataset from Section 3 by imposing a marginal MMD regularization $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}Z$ during training and evaluating risk-invariance across multiple $P^{\prime}\in\mathcal{P}$ . When training on $P$ , we observe that the regularization allows to ’flatten’ the curve, such that from medium to high values of MMD regularization, the model is risk-invariant (Figure 4(a)). On the jointly balanced data, medium values of the regularization degrade risk-invariance (see green curves on Figure 4(b)). Overall, model performance is also lower for the models trained on $Q$ compared to models trained on $P$ across test sets from $P^{\prime}\in\mathcal{P}$ , at similar levels of regularization (see Figure 4(c) for MMD=16). This result displays that $X^{\perp}_{Z}$ is not a sufficient statistic for $Y$ in $Q$ .

6 Case study: distinguishing between failure modes in CelebA

In this section, we show that when $Y$ and $Z$ are available at training time, we can try to distinguish between failure modes of data balancing by using our different observations, even in the absence of a full causal graph. We illustrate this using the benchmark task of detecting blond hair in pictures of celebrities in the CelebA [45] dataset. This label has a strong correlation with perceived gender: half of the non-males have blond hair, while only $\sim 7\%$ of males do. We consider a balanced, subsampled dataset (train: $n=4,096$ , test/valid: $n=400$ )⁴⁴4Please note that these results were also replicated with a resampled dataset with $n=30,000$ for training. and the original, confounded dataset. We train a VGG [67] and four Vision Transformer [ViT, 18] architectures, with number of parameters ranging from 17 to 690 millions.

We observe that, while training with balanced data leads to higher worst group accuracy and lower equalized odds scores than training with the historical data (Table 2), an important gap remains between the overall and worst group performances. These results show that data balancing leads to improvements in downstream fairness and robustness metrics, but does not provide a risk-invariant or fair model on its own. Therefore, it is likely that one of the conditions for data balancing to be sufficient is not fulfilled and understanding which condition is violated can guide our selection of another technique.

Distinguishing between failure modes. We first assume that the task is anti-causal. We then aim to understand whether there is another confounder, the data is entangled, or the representation is entangled (Proposition 4.4). As per Kirichenko et al. [40], we first attempt to improve our representation by pre-training the VGG with ImageNet [16]. While we observe an increase in performance with pre-training, there is no clear decrease in equalized odds. This result suggests that the failure may lie elsewhere. We then train models with MMD on $P$ , with the expectation that we would observe a plateau for entangled data when the model learns $f(X^{\perp}_{Z})$ , or a stark decrease in worst group performance in the presence of another confounder. While there is no major pattern of correlation between $Y$ and another attribute in the balanced data (see Appendix E.2.2), small effects might combine, or there might be other, unobserved attributes that influence $Y$ . For a medium value of the regularization hyper-parameter, the model displays a plateau in performance and poor worst group performance. This result suggests an effect of another confounder and next steps can include methods such as Alabdulmohsin et al. [2], which controls for all (observed) auxiliary factors of variation.

Model	Acc. ( $\uparrow$ )	Worst Grp ( $\uparrow$ )	Encoding ( $\sim 0.5$ )	Equ. Odds ( $\downarrow$ )
Original	$0.791\pm 0.037$	$0.314\pm 0.093$	$0.868\pm 0.015$	$0.243\pm 0.036$
Balanced	$0.839\pm 0.022$	$0.674\pm 0.088$	$0.709\pm 0.066$	$0.125\pm 0.022$
Pre-trained	$0.874\pm 0.006$	$0.726\pm 0.037$	$0.740\pm 0.033$	$0.111\pm 0.010$
MMD on $P$	$0.813\pm 0.036$	$0.146\pm 0.172$	$0.630\pm 0.010$	$0.001\pm 0.002$

7 Related works

Balanced data as mitigation for invariant models. Our results extend those of Makar et al. [47] which considered a single causal graph. Wang et al. [75] displayed that balancing data did not lead to a reduction in bias amplification. The authors posit that this failure of balanced data to correct for spurious signals is due to unobserved confounding factors which is confirmed in Alabdulmohsin et al. [2]. Rolf et al. [62] investigated upsampling by relying on a scaling law per group, focusing on the question of fairness vs performance trade-off [22]. Focusing on causal NLP settings, Joshi et al. [36] investigated causal and non-causal features, concluding that data balancing does not help in all cases. Closer to our work is that of Puli et al. [57], in which the authors showed that having $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ does not imply that $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,X$ and the model can learn signals related to $Z$ . Puli et al. [57] propose a method to learn a representation $r$ such that $Y\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,r(X)$ . Our work provides a framework to understand these different failure modes and proposes strategies to distinguish between them. While we focus on pre-processing mitigation with a fixed distribution $Q(X,Y,Z)$ , another line of work considers dynamic resampling in-processing [e.g. 35, 60, 12]. As the resampling converges towards a fixed distribution $P\textquoteright(Z|Y)$ , we would expect failure modes in the presence of entangled data or of another confounder. Nevertheless, the variation in $P\textquoteright(Z|Y)$ at the early stages of training might be beneficial, e.g. by disentangling the representation. We leave this investigation for future work.

Causal feature selection. Some works have used a causal framing to select features such that $f(X)$ has robustness and/or fairness properties [e.g. 46, 70, 68, 25, 66]. Similarly, our work defines independence conditions on covariates to obtain an optimal, invariant model, and can be used to select features. Two major distinctions between feature selection works and ours reside in the fact that we consider the case in which we do not observe $X^{\perp}_{Z}$ explicitly and that we investigate the impact of data balancing.

8 Discussion

In this work, we uncover important results to guide the use of data balancing for mitigating undesired dependencies between covariates, outcomes and auxiliary factors of variation. We first show (Section 3) that joint data balancing might not achieve the desired fairness or robustness criteria, and that the failures may seem difficult to predict. Motivated by these results, we introduce conditions under which data balancing leads to a robust or fair model (Sections 4, B.2). Importantly, we show that data balancing is not equivalent to ‘drop** an edge’ in the causal graph and can lead to distributions that do not factorize according to the desired graph (Section 5). This can have downstream consequences if further mitigation strategies are motivated by the causal graph and highlights why regularization and data balancing might not go ‘hand in hand’. This last result shows that data balancing should not be performed as a ‘default’, and mitigation strategies should be based on the causal graph of the application. Finally, even in the absence of a causal graph, our findings may help to pinpoint which condition(s) are not fulfilled, and guide further mitigation (Section 6).

Limitations. The conditions defined in Section 4 for risk-invariance depend on the expression of $\mathcal{P}$ as a correlation shift [47, 61]. Other expressions are likely to lead to other conditions. In our experiments, we have mostly subsampled datasets to obtain balanced distributions. We would expect similar results for other joint balancing methods. Variations are, however, possible due to the finite-set nature of the computations [47], e.g. with reweighting displaying more variance [33], potentially under-performing in overparametrized settings [11, 64]. We also note that, while we aimed to provide upper bounds for the effectiveness of data balancing, we did not use additional training strategies for mitigation beyond regularization. We believe that our causal framework can be a useful tool to analyze other pre- or in-processing methods that enforce independence between variables in the data generating process [e.g. 1, 57]. On the other hand, our framework might not be suited to analyze the effects of other mitigation strategies, e.g. hyper-parameter optimization [56].

Future work. This work considered a variety of causal graphs in order to provide general insights rather than task-specific conditions. However, investigating specific graphs could enable to leverage further strategies including other balancing techniques [e.g. 71]. We believe that our causal framing could then be a useful resource to analyze the effect of these strategies on downstream fairness and robustness criteria. Finally, we illustrate our propositions with binary classification tasks and confounders. While our reasoning applies to more complex settings, there might be further considerations to account for when generalizing beyond binary variables, especially with respect to estimation.

Broader impact

Our work investigates a common mitigation strategy for failures of fairness or robustness in machine learning predictive settings. We aim to clearly highlight when data balancing is promising, and when it fails, hence advancing the field of trustworthy machine learning. As with most papers addressing fairness questions, we acknowledge that our mathematical formulations of fairness criteria might not correspond to the desired societal impact, e.g. in terms of equity. Specific considerations for our work include the use of the CelebA [45] dataset, and in particular the ‘is-male’ binary label provided. We acknowledge that a binary characterization of gender is not representative and can be harmful. In addition, it would be desirable to have self-reported instead of perceived gender. Our work considers cases for which auxiliary factors of variation $Z$ are observed at train, test or fine-tuning time. This is a limitation of our investigation, as our insights might not be available when $Z$ is unobserved. This is exemplified by the more difficult case of distinguishing between failure modes without a $P^{0}$ in the classification of CelebA images.

Acknowledgments and Disclosure of Funding

We thank Virginia Aglietti for feedback on this work and Victor Veitch for sharing experimental code for the Amazon reviews experiments. This work was funded by Google DeepMind.

References

Alabdulmohsin & Lučić [2021] Alabdulmohsin, I. and Lučić, M. A near-optimal algorithm for debiasing trained machine learning models. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=H5TBqNFPKSJ.
Alabdulmohsin et al. [2024] Alabdulmohsin, I., Wang, X., Steiner, A., Goyal, P., D’Amour, A., and Zhai, X. CLIP the bias: How useful is balancing data in multimodal learning? In International Conference on Learning Representations, 2024.
Anthis & Veitch [2023] Anthis, J. R. and Veitch, V. Causal context connects counterfactual fairness to robust prediction and group fairness. In Advances in Neural Information Processing Systems, volume 37, 2023. URL https://openreview.net/forum?id=AmwgBjXqc3.
Arjovsky et al. [2019] Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization, 2019. Preprint 1907.02893. URL http://arxiv.longhoe.net/abs/1907.02893.
Barocas et al. [2023] Barocas, S., Hardt, M., and Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023.
Ben-Tal et al. [2013] Ben-Tal, A., den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Manage. Sci., 59(2):341–357, 2013.
Bradbury et al. [2018] Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Brown et al. [2023] Brown, A., Tomasev, N., Freyberg, J., Liu, Y., Karthikesalingam, A., and Schrouff, J. Detecting shortcut learning for fair medical AI using shortcut testing. Nat. Commun., 14(1):4314, 2023.
Byrd & Lipton [2019] Byrd, J. and Lipton, Z. What is the effect of importance weighting in deep learning? In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 872–881. PMLR, 2019.
Carlini & Wagner [2017] Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017.
Celis et al. [2018] Celis, E., Keswani, V., Straszak, D., Deshpande, A., Kathuria, T., and Vishnoi, N. Fair and diverse DPP-based data summarization. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 716–725. PMLR, 2018. URL https://proceedings.mlr.press/v80/celis18a.html.
Chen et al. [2023] Chen, X., Fan, W., Chen, J., Liu, H., Liu, Z., Zhang, Z., and Li, Q. Fairly adaptive negative sampling for recommendations. In Proceedings of the ACM Web Conference 2023, WWW ’23, pp. 3723–3733, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394161. doi: 10.1145/3543507.3583355. URL https://doi.org/10.1145/3543507.3583355.
Chiappa [2019] Chiappa, S. Path-Specific counterfactual fairness. AAAI, 33(01):7801–7808, 2019.
Compton et al. [2023] Compton, R., Zhang, L., Puli, A., and Ranganath, R. When more is less: Incorporating additional datasets can hurt performance by introducing spurious correlations, 2023. Preprint 2308.04431. URL http://arxiv.longhoe.net/abs/2308.04431.
Cowell et al. [2007] Cowell, R. G., Dawid, A. P., Lauritzen, S., and Spiegelhalter, D. J. Probabilistic Networks and Expert Systems, Exact Computational Methods for Bayesian Networks. Springer-Verlag, 2007.
Deng et al. [2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE, 2009.
Deng [2012] Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
Drenkow et al. [2021] Drenkow, N., Sani, N., Shpitser, I., and Unberath, M. A systematic review of robustness in deep learning for computer vision: Mind the gap?, 2021. Preprint 2112.00639. URL http://arxiv.longhoe.net/abs/2112.00639.
Duchi et al. [2016] Duchi, J., Glynn, P., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach, 2016. Preprint 1610.03425. URL http://arxiv.longhoe.net/abs/1610.03425.
Duffy et al. [2022] Duffy, G., Clarke, S. L., Christensen, M., He, B., Yuan, N., Cheng, S., and Ouyang, D. Confounders mediate AI prediction of demographics in medical imaging. NPJ Digit Med, 5(1):188, 2022.
Dutta et al. [2020] Dutta, S., Wei, D., Yueksel, H., Chen, P.-Y., Liu, S., and Varshney, K. Is there a trade-off between fairness and accuracy? A perspective using mismatched hypothesis testing. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2803–2813. PMLR, 2020. URL https://proceedings.mlr.press/v119/dutta20a.html.
Dwork et al. [2012] Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, pp. 214–226, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450311151. doi: 10.1145/2090236.2090255. URL https://doi.org/10.1145/2090236.2090255.
Flores et al. [2016] Flores, A. W., Bechtel, K., and Lowenkamp, C. T. False positives, false negatives, and false analyses: A rejoinder to “machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.”. Fed. Probat., 80(2), 2016.
Galhotra et al. [2022] Galhotra, S., Shanmugam, K., Sattigeri, P., and Varshney, K. R. Causal feature selection for algorithmic fairness. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, pp. 276–285, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392495. doi: 10.1145/3514221.3517909. URL https://doi.org/10.1145/3514221.3517909.
Geirhos et al. [2019] Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bygh9j09KX.
Gichoya et al. [2022] Gichoya, J. W., Banerjee, I., Bhimireddy, A. R., Burns, J. L., Celi, L. A., Chen, L.-C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.-C., Kuo, P.-C., Lungren, M. P., Palmer, L. J., Price, B. J., Purkayastha, S., Pyrros, A. T., Oakden-Rayner, L., Okechukwu, C., Seyyed-Kalantari, L., Trivedi, H., Wang, R., Zaiman, Z., and Zhang, H. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health, 4(6):e406–e414, 2022.
Gretton et al. [2012] Gretton, A., Borgwardt, K. M., Rasch, M. J., and Scholkopf, B. A kernel Two-Sample test. J. Mach. Learn. Res., 13(25):723–773, 2012.
Hardt et al. [2016] Hardt, M., Price, E., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
Harris et al. [2020] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., Del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature, 585(7825):357–362, 2020.
Hooker et al. [2020] Hooker, S., Moorosi, N., Clark, G., Bengio, S., and Denton, E. Characterising bias in compressed models, 2020. Preprint 2010.03058. URL http://arxiv.longhoe.net/abs/2010.03058.
Hunter [2007] Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng., 9(3):90–95, 2007.
Idrissi et al. [2022] Idrissi, B. Y., Arjovsky, M., Pezeshki, M., and Lopez-Paz, D. Simple data balancing achieves competitive worst-group-accuracy. In Schölkopf, B., Uhler, C., and Zhang, K. (eds.), Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp. 336–351. PMLR, 2022. URL https://proceedings.mlr.press/v177/idrissi22a.html.
J. Devlin & Toutanova [2019] J. Devlin, M.-W. Chang, K. L. and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), volume 1, pp. 2, 2019.
Jiang & Nachum [2020] Jiang, H. and Nachum, O. Identifying and correcting label bias in machine learning. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 702–712. PMLR, 2020. URL https://proceedings.mlr.press/v108/jiang20a.html.
Joshi et al. [2022] Joshi, N., Pan, X., and He, H. Are all spurious features in natural language alike? an analysis through a causal lens. In Empirical Methods in Natural Language Processing (EMNLP), 2022.
Kamiran & Calders [2012] Kamiran, F. and Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst., 33(1):1–33, 2012.
Kehrenberg et al. [2020] Kehrenberg, T., Chen, Z., and Quadrianto, N. Tuning fairness by balancing target labels. Front Artif Intell, 3:33, 2020.
Kim et al. [2023] Kim, D., Park, S., Hwang, S., and Byun, H. Fair classification by loss balancing via fairness-aware batch sampling. Neurocomputing, 518:231–241, 2023.
Kirichenko et al. [2022] Kirichenko, P., Izmailov, P., and Wilson, A. G. Last layer re-training is sufficient for robustness to spurious correlations, 2022. Preprint 2204.02937. URL http://arxiv.longhoe.net/abs/2204.02937.
Koller & Friedman [2009] Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
Krizhevsky et al. [2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
LaBonte et al. [2023] LaBonte, T., Muthukumar, V., and Kumar, A. Towards last-layer retraining for group robustness with fewer annotations, 2023. Preprint 2309.08534. URL http://arxiv.longhoe.net/abs/2309.08534.
Lecun et al. [1998] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
Liu et al. [2015] Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015.
Magliacane et al. [2018] Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional distributions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
Makar et al. [2022] Makar, M., Packer, B., Moldovan, D., Blalock, D., Halpern, Y., and D’Amour, A. Causally motivated shortcut removal using auxiliary labels. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 739–766. PMLR, 2022. URL https://proceedings.mlr.press/v151/makar22a.html.
Mao et al. [2023] Mao, Y., Deng, Z., Yao, H., Ye, T., Kawaguchi, K., and Zou, J. Last-layer fairness fine-tuning is simple and effective for neural networks, 2023. Preprint 2304.03935. URL http://arxiv.longhoe.net/abs/2304.03935.
McKinney [2010] McKinney, W. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. SciPy, 2010.
Mehrabi et al. [2021] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv., 54(6):1–35, 2021.
Mooij et al. [2020] Mooij, J. M., Magliacane, S., and Claassen, T. Joint causal inference from multiple contexts. J. Mach. Learn. Res., 21(99):1–108, 2020.
Ni et al. [2019] Ni, J., Li, J., and McAuley, J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 188–197, 2019.
Obermeyer et al. [2019] Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
Pearl [1988] Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., 1988.
Pearl [2000] Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
Perrone et al. [2021] Perrone, V., Donini, M., Zafar, M. B., Schmucker, R., Kenthapadi, K., and Archambeau, C. Fair bayesian optimization. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 854–863, 2021.
Puli et al. [2022] Puli, A. M., Zhang, L. H., Oermann, E. K., and Ranganath, R. Out-of-distribution generalization in the presence of nuisance-induced spurious correlations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=12RoR2o32T.
Quinonero-Candela et al. [2022] Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (eds.). Dataset shift in machine learning. Neural Information Processing series. MIT Press, London, England, 2022.
Rančić et al. [2021] Rančić, S., Radovanović, S., and Delibašić, B. Investigating oversampling techniques for fair machine learning models. In Decision Support Systems XI: Decision Support Systems, Analytics and Technologies in Response to Global Crisis Management, pp. 110–123. Springer International Publishing, 2021.
Roh et al. [2021] Roh, Y., Lee, K., Whang, S. E., and Suh, C. Fairbatch: Batch selection for model fairness. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YNnpaAKeCfx.
Roh et al. [2023] Roh, Y., Lee, K., Whang, S. E., and Suh, C. Improving fair training under correlation shifts. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 29179–29209. PMLR, 2023. URL https://proceedings.mlr.press/v202/roh23a.html.
Rolf et al. [2021] Rolf, E., Worledge, T. T., Recht, B., and Jordan, M. Representation matters: Assessing the importance of subgroup allocations in training data. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9040–9051. PMLR, 2021. URL https://proceedings.mlr.press/v139/rolf21a.html.
Sagawa* et al. [2020] Sagawa*, S., Koh*, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS.
Sagawa et al. [2020] Sagawa, S., Raghunathan, A., Koh, P. W., and Liang, P. An investigation of why overparameterization exacerbates spurious correlations. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8346–8356. PMLR, 2020. URL https://proceedings.mlr.press/v119/sagawa20a.html.
Schölkopf et al. [2012] Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. In International Conference on Machine Learning, pp. 459–466, 2012.
Schrouff et al. [2022] Schrouff, J., Harris, N., Koyejo, S., Alabdulmohsin, I. M., Schnider, E., Opsahl-Ong, K., Brown, A., Roy, S., Mincu, D., Chen, C., Dieng, A., Liu, Y., Natarajan, V., Karthikesalingam, A., Heller, K. A., Chiappa, S., and D’Amour, A. Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 19304–19318. Curran Associates, Inc., 2022.
Simonyan & Zisserman [2015] Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
Singh et al. [2021] Singh, H., Singh, R., Mhasawade, V., and Chunara, R. Fairness violations and mitigation under covariate shift. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 3–13. Association for Computing Machinery, New York, NY, USA, 2021.
Sreekumar & Boddeti [2023] Sreekumar, G. and Boddeti, V. N. Spurious correlations and where to find them, 2023. Preprint 2308.11043. URL http://arxiv.longhoe.net/abs/2308.11043.
Subbaswamy & Saria [2018] Subbaswamy, A. and Saria, S. Counterfactual normalization: Proactively addressing dataset shift using causal mechanisms. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp. 947–957. Association For Uncertainty in Artificial Intelligence (AUAI), 2018.
Sun et al. [2023] Sun, Q., Murphy, K., Ebrahimi, S., and D’Amour, A. Beyond invariance: Test-time label-shift adaptation for distributions with "spurious" correlations, 2023. Preprint 2211.15646. URL http://arxiv.longhoe.net/abs/2211.15646.
Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. Training data-efficient image transformers & distillation through attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10347–10357. PMLR, 2021.
Veitch et al. [2021] Veitch, V., D’Amour, A., Yadlowsky, S., and Eisenstein, J. Counterfactual invariance to spurious correlations in text classification. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=BdKxQp0iBi8.
Wang & Russakovsky [2023] Wang, A. and Russakovsky, O. Overcoming bias in pretrained models by manipulating the finetuning dataset, 2023. Preprint 2303.06167. URL http://arxiv.longhoe.net/abs/2303.06167.
Wang et al. [2019] Wang, T., Zhao, J., Yatskar, M., Chang, K., and Ordonez, V. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5309–5318, Los Alamitos, CA, USA, 2019. IEEE Computer Society. doi: 10.1109/ICCV.2019.00541. URL https://doi.ieeecomputersociety.org/10.1109/ICCV.2019.00541.
Wu et al. [2023] Wu, S., Yuksekgonul, M., Zhang, L., and Zou, J. Discover and cure: concept-aware mitigation of spurious correlation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
Yan et al. [2020] Yan, S., Kao, H.-T., and Ferrara, E. Fair class balancing: Enhancing model fairness without observing sensitive attributes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pp. 1715–1724, New York, NY, USA, 2020. Association for Computing Machinery.
Yang et al. [2023a] Yang, Y., Nushi, B., Palangi, H., and Mirzasoleiman, B. Mitigating spurious correlations in multi-modal models during fine-tuning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 39365–39379. PMLR, 2023a. URL https://proceedings.mlr.press/v202/yang23j.html.
Yang et al. [2023b] Yang, Y., Zhang, H., Gichoya, J. W., Katabi, D., and Ghassemi, M. The limits of fair medical imaging ai in the wild, 2023b. Preprint 2312.10083. URL http://arxiv.longhoe.net/abs/2312.10083.
Zemel et al. [2013] Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 325–333, Atlanta, Georgia, USA, 2013. PMLR. URL https://proceedings.mlr.press/v28/zemel13.html.

Appendix A Failure modes of data balancing

A.1 Failure mode: Balancing on one variable can increase bias

It is common to consider balancing on classes or groups as it requires fewer labels than joint balancing. However, without further intervention, class or group balancing on its own does not provide an invariant model when $Y$ and $Z$ are marginally dependent [e.g. 43]. In Figure 1(a), this means that $X^{\perp}_{Z}\centernot{\mathrel{\perp\mspace{-10.0mu}\perp}}_{Q}Z\,|\,Y$ , invalidating Prop.4.2. Below, we formalize the observation in Yan et al. [77] that balancing on one variable might affect the representation of the other, and provide bounds on the impact of this strategy.

Formalization and proof.

We formalize this issue in Proposition A.1 for the binary case with a binary attribute.

Proposition A.1.

Consider data balancing of $Y$ ; the marginal of $Z$ will be farther from uniform than the marginal of $Z$ before balancing if

\mathrm{sgn}\left(\frac{P(Z=1)-\frac{1}{2}}{P(Y=1)-\frac{1}{2}}\right)=\mathrm% {sgn}\left(\operatorname{\mathbb{E}}[Z|Y=0]-\operatorname{\mathbb{E}}[Z|Y=1]% \right).

Intuitively, if the biases of $Y$ and $Z$ are in the same (resp. opposite) direction, then this condition is satisfied if $Z$ has a negative (resp. positive) correlation with $Y$ . For example, if we have $P(Y=1)=\frac{1}{4}$ , $\operatorname{\mathbb{E}}[Z|Y=1]=1$ and $\operatorname{\mathbb{E}}[Z|Y=0]=\frac{1}{3}$ , then $\operatorname{\mathbb{E}}[Z]=\frac{1}{2}$ before balancing but $\operatorname{\mathbb{E}}[Z]=\frac{1}{3}$ after balancing.

Proof of Proposition A.1..

We assume that $Y$ and $Z$ , representing the label and confounder, are both binary. We will data-balance on $Y$ . Let $Z\,|\,S$ denote the distribution of $Z$ after data balancing. To characterize when the distribution of $Z\,|\,S$ is farther from uniform than the distribution of $Z$ , we will first derive

\operatorname{\mathbb{E}}[Z]-\frac{1}{2}=p(Y=1)\left(\operatorname{\mathbb{E}}% [Z\,|\,Y=1]-\frac{1}{2}\right)+p(Y=0)\left(\operatorname{\mathbb{E}}[Z\,|\,Y=0% ]-\frac{1}{2}\right)

and

\operatorname{\mathbb{E}}[Z\,|\,S]-\frac{1}{2}=\frac{1}{2}\left(\operatorname{% \mathbb{E}}[Z\,|\,Y=1]-\frac{1}{2}\right)+\frac{1}{2}\left(\operatorname{% \mathbb{E}}[Z\,|\,Y=0]-\frac{1}{2}\right).

Now, taking the difference, we have

	$\displaystyle\operatorname{\mathbb{E}}[Z]-\frac{1}{2}$	$\displaystyle=\operatorname{\mathbb{E}}[Z\,\|\,S]-\frac{1}{2}+\left(P(Y=1)-% \frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=1]-\frac{1}{2}\right% )+\left(P(Y=0)-\frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=0]-% \frac{1}{2}\right)$
		$\displaystyle=\operatorname{\mathbb{E}}[Z\,\|\,S]-\frac{1}{2}+\left(P(Y=1)-% \frac{1}{2}\right)\operatorname{\mathbb{E}}[Z\,\|\,Y=1]+\left(P(Y=0)-\frac{1}{2% }\right)\operatorname{\mathbb{E}}[Z\,\|\,Y=0]$
		$\displaystyle=\operatorname{\mathbb{E}}[Z\,\|\,S]-\frac{1}{2}+\left(P(Y=1)-% \frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=1]-\operatorname{% \mathbb{E}}[Z\,\|\,Y=0]\right).$

We can derive some sufficient conditions for bias increase, which occurs when $|\operatorname{\mathbb{E}}[Z\,|\,S]-\frac{1}{2}|\geq|\operatorname{\mathbb{E}}% [Z]-\frac{1}{2}|$ . We proceed by cases. If $\operatorname{\mathbb{E}}[Z]-\frac{1}{2}>0$ , then

	$\displaystyle\operatorname{\mathbb{E}}[Z\,\|\,S]-\frac{1}{2}$	$\displaystyle=\operatorname{\mathbb{E}}[Z]-\frac{1}{2}+\left(P(Y=1)-\frac{1}{2% }\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=1]-\operatorname{\mathbb{E}}[Z% \,\|\,Y=0]\right)$
		$\displaystyle=\left\|\operatorname{\mathbb{E}}[Z]-\frac{1}{2}\right\|+\left(P(Y=% 1)-\frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=1]-\operatorname{% \mathbb{E}}[Z\,\|\,Y=0]\right),$

so $\left|\operatorname{\mathbb{E}}[Z\,|\,S]-\frac{1}{2}\right|=\left|% \operatorname{\mathbb{E}}[Z]-\frac{1}{2}\right|+\left(P(Y=1)-\frac{1}{2}\right% )\left(\operatorname{\mathbb{E}}[Z\,|\,Y=1]-\operatorname{\mathbb{E}}[Z\,|\,Y=% 0]\right)$ . Thus, the bias gets worse if $\left(P(Y=1)-\frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,|\,Y=1]-% \operatorname{\mathbb{E}}[Z\,|\,Y=0]\right)>0$ .

Similar reasoning shows that if $\operatorname{\mathbb{E}}[Z]-\frac{1}{2}<0$ , then

\left|\operatorname{\mathbb{E}}[Z\,|\,S]-\frac{1}{2}\right|=\left|% \operatorname{\mathbb{E}}[Z]-\frac{1}{2}\right|-\left(P(Y=1)-\frac{1}{2}\right% )\left(\operatorname{\mathbb{E}}[Z\,|\,Y=1]-\operatorname{\mathbb{E}}[Z\,|\,Y=% 0]\right),

and we can conclude that the bias is worsened if $\left(P(Y=1)-\frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,|\,Y=1]-% \operatorname{\mathbb{E}}[Z\,|\,Y=0]\right)<0$ . Taking both statements together, we obtain the statement of the proposition. ∎

For example, if we have $P(Y=1)=\frac{1}{4}$ , $\operatorname{\mathbb{E}}[Z\,|\,Y=1]=1$ and $\operatorname{\mathbb{E}}[Z\,|\,Y=0]=\frac{1}{3}$ , then $\operatorname{\mathbb{E}}[Z]=\frac{1}{2}$ but $\operatorname{\mathbb{E}}[Z\,|\,S]-\frac{1}{2}=\frac{1}{6}$ ; despite $Z$ starting as unbiased, the data balancing induces a bias of $\frac{1}{6}$ .

There are a few implications of this derivation. First, we obtain an easy upper bound for the worsening of the bias of $Z$ caused by data balancing: taking absolute values of both sizes and using the triangle inequality on the right yields

\left|\operatorname{\mathbb{E}}[Z]-\frac{1}{2}\right|\leq\left|\operatorname{% \mathbb{E}}[Z\,|\,S]-\frac{1}{2}\right|+\left|P(Y=1)-\frac{1}{2}\right|\left|% \operatorname{\mathbb{E}}[Z\,|\,Y=1]-\operatorname{\mathbb{E}}[Z\,|\,Y=0]% \right|,

Bringing the second term over to the left hand side and applying the same logic produces

\left|\operatorname{\mathbb{E}}[Z\,|\,S]-\frac{1}{2}\right|\leq\left|% \operatorname{\mathbb{E}}[Z]-\frac{1}{2}\right|+\left|P(Y=1)-\frac{1}{2}\right% |\left|\operatorname{\mathbb{E}}[Z\,|\,Y=1]-\operatorname{\mathbb{E}}[Z\,|\,Y=% 0]\right|,

and combining both terms shows that the difference in bias of $Z$ and $Z\,|\,S$ is bounded by

\left|\left|\operatorname{\mathbb{E}}[Z]-\frac{1}{2}\right|-\left|% \operatorname{\mathbb{E}}[Z\,|\,S]-\frac{1}{2}\right|\right|\leq\left|P(Y=1)-% \frac{1}{2}\right|\left|\operatorname{\mathbb{E}}[Z\,|\,Y=1]-\operatorname{% \mathbb{E}}[Z\,|\,Y=0]\right|.

Simulation.

We present a simple simulation to illustrate our reasoning: $U\sim\mathcal{N}(0,0.1)$ is a common cause to $Z$ and $Y$ . More specifically, the continuous distributions of $Y$ and $Z$ both have the form $U+\epsilon$ , with $\epsilon\sim\mathcal{N}(0.05,0.02)$ . We then binarize $Y$ by thresholding at 0. This creates an imbalance in the marginal of $Y$ , such that a random sample of 5,000 examples has $\sim 68\%$ of positive labels. We then want to vary the marginal of $Z$ , which also requires affecting their correlation. To this end, we vary the threshold for binarizing $Z$ . This leads us to 2 main cases: for thresholds above 0 (i.e. $Y$ ’s threshold), the marginal of $Z$ is imbalanced in the same direction as that of $Y$ . For thresholds smaller than 0., we obtain the opposite, i.e. if $Y=1$ is over-represented, $Z=1$ is under-represented.

We illustrate these 2 cases in Figure 6. We observe that when the marginals are similar, balancing $Y$ brings $Z$ closer to a uniform distribution (top row). However, the marginal distribution of $Z$ becomes more imbalanced after balancing on $Y$ if the two distributions are reversed (bottom row). When the correlation is small, there is little change in the marginal of $Z$ when balancing on $Y$ , which is expected.

For completeness, we perform 200 simulations with different thresholdings for $Z$ and present the results in Figure 7.

A.2 Failure mode: entangled signals

In the case where $X$ includes non-trivial intersection information $X_{Y\wedge Z}$ , data balancing will in general be insufficient to ensure that there is no association bias. This is because a risk-minimizing predictor $f(X)$ will condition on $X_{Y\wedge Z}$ , and the distribution of these intersection features is influenced by $Z$ .

Specifically, we will give a case where $Y$ is marginally independent of $Z$ and there is no uncontrolled confounding, but $E[f(X)\mid Z=0]\neq E[f(X)\mid Z=1]$ .

Suppose we have the following data generating process (DGP):

	$\displaystyle P(Y=1)$	$\displaystyle=0.5$
	$\displaystyle P(Z=1)$	$\displaystyle=0.5$
	$\displaystyle P(Y=1,Z=1)$	$\displaystyle=P(Y=1)P(Z=1)\textrm{, i.e., $Y^{\perp}_{Z}$}$
	$\displaystyle P(X=1)$	$\displaystyle=\left\{\begin{array}[]{rl}p&\textrm{if $Y$ OR $Z$}\\ q&\textrm{o.w.}\end{array}\right.$

Note that in this case the entirety of $X$ would be classified as intersection information $X_{Y\wedge Z}$ .

In this setup, the Bayes-optimal probabilities for classification, $f(X)$ , are given by:

\displaystyle f(1):=P(Y=1\mid X=1)

\displaystyle=\frac{P(X=1\mid Y=1)P(Y=1)}{P(X=1)}=\frac{p\cdot 0.5}{0.75p+0.25q}

and

\displaystyle f(0):=P(Y=1\mid X=0)

\displaystyle=\frac{(1-P(X=1\mid Y=1))P(Y=1)}{P(X=0)}=\frac{(1-p)\cdot 0.5}{1-% (0.75p+0.25q)}

Note that when we condition on $Z={0,1}$ , the expectation of $f(X)$ is different whenever (1) $p\neq q$ , i.e., whenever the distribution of $X$ actually depends on the function of $Y$ and $Z$ , and (2) $f(1)\neq f(0)$ , i.e., there is some information in $X$ to predict $Y$ :

\displaystyle E[f(X)\mid Z=1]

\displaystyle=E[E[f(X)\mid X,Z=1]]=pf(1)+(1-p)f(0)

	$\displaystyle E[f(X)\mid Z=0]$	$\displaystyle=E[E[f(X)\mid X,Z=0]]$
		$\displaystyle=(0.5p+0.5q)f(1)+(0.5(1-p)+0.5(1-q))f(0)$

In the simple case where $p=1$ and $q=0$ (i.e., $X=Y\textrm{ OR }Z$ deterministically), we get

f(X):=P(Y=1\mid X)=\left\{\begin{array}[]{rl}2/3&\textrm{if }X=1\\ 0&\textrm{if }X=0.\end{array}\right.

E[f(X)\mid Z]=\left\{\begin{array}[]{rl}2/3&\textrm{if }Z=1\\ 1/3&\textrm{if }Z=0.\end{array}\right.

Appendix B Conditions for data balancing to lead to an invariant and optimal model

We first investigate the case of a risk-invariant model w.r.t $\mathcal{P}$ , and then discuss fairness criteria.

B.1 Risk-invariant, optimal model

In this section we provide proofs for Section 4.

Recall that $\mathcal{P}=\{P^{\prime}(X,Y,Z)=P(X^{\perp}_{Z}|Y,Z)P(X^{\perp}_{Y}|Y,Z)P(X_{Z% \wedge Y}|Y,Z)P^{\prime}(Z|Y)P(Y)\}$ and that we assume a data balancing distribution $Q(X,Y,Z)\in\mathcal{P}$ of the form $Q(X,Y,Z)=P(X\,|\,Y,Z)P(Z)P(Y)$ . Also recall that we define $X^{\perp}_{Z}$ to be a sufficient statistic for $Y$ in $Q$ if $\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X^{% \perp}_{Y}]$ .

Proposition 4.2.

Proof.

Let $P^{\prime}$ be an arbitrary distribution in $\mathcal{P}$ . We have

	$\displaystyle P^{\prime}(X^{\perp}_{Z}\,\|\,Y)$	$\displaystyle=\sum_{Z}P^{\prime}(X^{\perp}_{Z}\,\|\,Y,Z)P^{\prime}(Z\,\|\,Y)$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\sum_{Z}Q(X^{\perp}_{Z}\,\|\,Y,% \cancel{Z})P^{\prime}(Z\,\|\,Y)$
		$\displaystyle=Q(X^{\perp}_{Z}\,\|\,Y).$

where (1) holds as $P^{\prime},Q\in{\mathcal{P}}$ and by the independence assumption. As $P^{\prime}(Y)=Q(Y)$ we obtain $P^{\prime}(X^{\perp}_{Z},Y)=Q(X^{\perp}_{Z},Y)$ . As $X^{\perp}_{Z}$ is a sufficient statistic for $Y$ in $Q$ , $f(X):=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]=\operatorname{\mathbb{E}}_{Q}[Y\,% |\,X^{\perp}_{Z}]$ , that is $f(X)$ (and therefore the loss $\ell(f;X,Y)$ ) remains constant for different values of $X^{\perp}_{Y},X_{Y\wedge Z}$ , giving

\displaystyle\operatorname{\mathbb{E}}_{X,Y\sim P^{\prime}}[\ell(f;X,Y)]=% \operatorname{\mathbb{E}}_{X^{\perp}_{Z},Y\sim P^{\prime}}[\ell(f;X,Y)]=% \operatorname{\mathbb{E}}_{X^{\perp}_{Z},Y\sim Q}[\ell(f;X,Y)].

The same reasoning can be repeated for $P^{\prime\prime}\in\mathcal{P}$ , obtaining $\operatorname{\mathbb{E}}_{X,Y\sim P^{\prime}}[\ell(f;X,Y)]=\operatorname{% \mathbb{E}}_{X,Y\sim P^{\prime\prime}}[\ell(f;X,Y)]$ , which proves that $f$ is risk-invariant w.r.t. $\mathcal{P}$ .
As $f=\min_{f^{\prime}}\operatorname{\mathbb{E}}_{X,Y\sim Q}[\ell(f^{\prime};X,Y)]$ and $\operatorname{\mathbb{E}}_{X,Y\sim P^{\prime}}[\ell(f;X,Y)]=\operatorname{% \mathbb{E}}_{X,Y\sim Q}[\ell(f;X,Y)$ $\forall P^{\prime}\in\mathcal{P}$ , we obtain $f=\min_{f^{\prime}}\operatorname{\mathbb{E}}_{X,Y\sim P^{\prime}}[\ell(f^{% \prime};X,Y)]\big{)}$ , $\forall P^{\prime}\in\mathcal{P}$ , which implies that $f$ is optimal w.r.t. $\mathcal{P}$ . ∎

Corollary 4.3.

Let $R=\{X^{\perp}_{Y},X_{Y\wedge Z}\}$ . If $R\mathrel{\perp\mspace{-10.0mu}\perp}_{P}\{X^{\perp}_{Z},Y\}\,|\,Z$ and $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,Y$ , then $f(X^{\perp}_{Z})=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X^{\perp}_{Z}]$ is risk-invariant and optimal w.r.t. $\mathcal{P}$ .

Proof.

We have

	$\displaystyle Q(Y\,\|\,R,X^{\perp}_{Z})$	$\displaystyle=\frac{\sum_{Z}Q(R,X^{\perp}_{Z},Y,Z)}{\sum_{Z,Y}Q(R,X^{\perp}_{Z% },Y,Z)}$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\frac{\sum_{Z}P(R,X^{\perp}_{Z}% \,\|\,Y,Z)P(Z)P(Y)}{\sum_{Z,Y}P(R,X^{\perp}_{Z}\,\|\,Y,Z)P(Z)P(Y)}$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\frac{\sum_{Z}P(R\,\|\,\cancel{X^% {\perp}_{Z},Y},Z)P(X^{\perp}_{Z}\,\|\,Y,\cancel{Z})P(Z)P(Y)}{\sum_{Z,Y}P(R\,\|\,% \cancel{X^{\perp}_{Z},Y},Z)P(X^{\perp}_{Z}\,\|\,Y,\cancel{Z})P(Z)P(Y)}$
		$\displaystyle=\frac{P(R)P(X^{\perp}_{Z}\,\|\,Y)P(Y)}{P(R)\sum_{Y}P(X^{\perp}_{Z% }\,\|\,Y)P(Y)}$
		$\displaystyle=P(Y\,\|\,X^{\perp}_{Z}),$

where (1) holds by the definition of the balanced distribution $Q$ and (2) holds by the independence assumptions. This derivation shows that $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}R\,|\,X^{\perp}_{Z}$ and therefore that $X^{\perp}_{Z}$ is a sufficient statistic for $Y$ in $P$ . We are in the same conditions as in Proposition 4.2, which implies that $f$ is risk-invariant and optimal w.r.t. $\mathcal{P}$ . ∎

Proposition 4.4.

Let $\phi(\cdot)$ be disentangled with $\mathbb{E}_{P}^{\prime}[Y\,|\,\phi^{\perp}_{Z}(X)]=\mathbb{E}_{P}^{\prime}[Y\,% |\,X^{\perp}_{Z}]\forall P^{\prime}\in\mathcal{P}$ and $h$ be a linear function. The risk-minimizer $f(X):=\operatorname{\mathbb{E}}_{Q}[Y\,|\,X]$ is optimal and risk-invariant across $\mathcal{P}$ if $\operatorname{\mathbb{E}}_{P^{\prime}}[Y\,|\,f(X)]=\operatorname{\mathbb{E}}_{% P^{\prime}}[Y\,|\,h(\phi^{\perp}_{Z}(X))]\forall P^{\prime}\in\mathcal{P}$ , $X^{\perp}_{Z}$ is a sufficient statistic for $Y$ in $Q$ and $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ .

Proof.

The proof is straightforward as it directly depends on the definition of a disentangled representation and the previous statements:

	$\displaystyle\operatorname{\mathbb{E}}_{P^{\prime}}[Y\,\|\,f(X)]$	$\displaystyle=\operatorname{\mathbb{E}}_{P^{\prime}}[Y\,\|\,h(\phi^{\perp}_{Z}(% X))]$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\operatorname{\mathbb{E}}_{P^{% \prime}}[Y\,\|\,h(X^{\perp}_{Z})]$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\operatorname{\mathbb{E}}_{P}[Y% \,\|\,h(X^{\perp}_{Z})]$

Where (1) reflects the assumption of a disentangled representation, and (2) uses the proof of Proposition 4.2. ∎

B.2 Conditions for data balancing to lead to a fair model

This section gives several results to illustrate the fact that data balancing implemented to generate independence between outcomes $Y$ and sensitive attributes $Z$ does not necessarily imply that a function of some covariates $X$ to predict $Y$ will be independent of (or not encode information on) $Z$ . The results we describe do not address the case where $X^{\perp}_{Z}$ is not accessible directly.

Proposition B.1 (Demographic parity).

$X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ if $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,Y$ ; that is balancing successfully induces independence between $X^{\perp}_{Z}$ and $Z$ if $X^{\perp}_{Z}$ and $Z$ are independent given $Y$ in the original data distribution.

Proof.

Let $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,Y$ . The following derivation demonstrates the claim,

	$\displaystyle Q(X^{\perp}_{Z}\,\|\,Z)$	$\displaystyle=\frac{\sum_{Y}Q(X^{\perp}_{Z},Y,Z)}{\sum_{Y,X^{\perp}_{Z}}Q(X^{% \perp}_{Z},Y,Z)}$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\frac{\sum_{Y}P(X^{\perp}_{Z}\,\|% \,Y,Z)P(Y)P(Z)}{\sum_{Y,X^{\perp}_{Z}}P(X^{\perp}_{Z}\,\|\,Y,Z)P(Y)P(Z)}$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\frac{\sum_{Y}P(X^{\perp}_{Z}\,\|% \,Y)P(Y)P(Z)}{\sum_{Y,X^{\perp}_{Z}}P(X^{\perp}_{Z}\,\|\,Y)P(Y)P(Z)}$
		$\displaystyle=P(X^{\perp}_{Z}),$

where $(1)$ holds by the definition of data balancing on the joint, $(2)$ holds by the assumption of conditional independence. Therefore, the l.h.s is not a function of $Z$ which establishes marginal independence. ∎

Proposition B.2.

In general, $X^{\perp}_{Z}$ and $Z$ are not independent in $Q$ if $X^{\perp}_{Z}$ and $Z$ are not independent given $Y$ in $P$ ; that is data balancing does not induce independence between $X^{\perp}_{Z}$ and $Z$ if $X^{\perp}_{Z}$ and $Z$ are not independent given $Y$ in the original data distribution.

Proof.

Note first that the reduction in $(2)$ does not hold in general without conditional independence. Further, note that,

\displaystyle Q(X^{\perp}_{Z}\,|\,Z)=\sum_{Y}Q(X^{\perp}_{Z}\,|\,Z,Y)Q(Y\,|\,Z% )=\sum_{Y}Q(X^{\perp}_{Z}\,|\,Z,Y)Q(Y).

If $X^{\perp}_{Z}$ and $Z$ are dependent given $Y$ in $P$ then $X^{\perp}_{Z}$ and $Z$ are dependent given $Y$ in $Q$ so that $Q(X^{\perp}_{Z}\,|\,Z,Y)$ varies with $Z$ , making the l.h.s a function of $Z$ in general. Therefore, in general, data balancing will not be successful without conditional independence. ∎

Proposition B.3 (Predictive parity).

$Y\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,X^{\perp}_{Z}$ if $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,Y$ ; that is data balancing successfully induces independence between $Y$ and $Z$ given $X^{\perp}_{Z}$ if $X^{\perp}_{Z}$ and $Z$ are independent given $Y$ in the original data distribution.

Proof.

Let $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,Y$ . The following derivation demonstrates the claim,

	$\displaystyle Q(Y\,\|\,X^{\perp}_{Z},Z)$	$\displaystyle=\frac{Q(X^{\perp}_{Z},Y,Z)}{\sum_{Y}Q(X^{\perp}_{Z},Y,Z)}$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\frac{P(X^{\perp}_{Z}\,\|\,Y,Z)P(% Y)P(Z)}{\sum_{Y}P(X^{\perp}_{Z}\,\|\,Y,Z)P(Y)P(Z)}$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\frac{P(X^{\perp}_{Z}\,\|\,Y)P(Y)% P(Z)}{\sum_{Y}P(X^{\perp}_{Z}\,\|\,Y)P(Y)P(Z)}$
		$\displaystyle=P(Y\,\|\,X^{\perp}_{Z}),$

where $(1)$ holds by the definition of data balancing on the joint, $(2)$ holds by the assumption of conditional independence. Therefore, the l.h.s is not a function of $z$ which establishes conditional independence. ∎

Proposition B.4.

In general, $Y$ and $Z$ are not independent given $X^{\perp}_{Z}$ in $Q$ if $X^{\perp}_{Z}$ and $Z$ are not independent given $Y$ in $P$ ; that is data balancing does not induce independence between $Y$ and $Z$ given $X^{\perp}_{Z}$ if $X^{\perp}_{Z}$ and $Z$ are not independent given $Y$ in the original data distribution.

Proof.

Similarly to the arguments above, the reduction in $(2)$ does not hold in general without conditional independence. Therefore, in general, data balancing will not be successful without conditional independence. ∎

Proposition B.5 (Equalized odds).

$(X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y)_{Q}$ if $(X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y)_{P}$ ; that is data balancing does not disturb independence between $X^{\perp}_{Z}$ and $Z$ given $Y$ if $X^{\perp}_{Z}$ and $Z$ are independent given $Y$ in the original data distribution.

Proof.

Let $(X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y)_{P}$ . Note that in this case we just need to show that data balancing does not disturb the conditional independence $(X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y)_{P}$ present in the original data (we already had equalized odds in original data). The following derivation demonstrates the claim,

	$\displaystyle Q(X^{\perp}_{Z}\,\|\,Z,Y)$	$\displaystyle=\frac{Q(X^{\perp}_{Z},Y,Z)}{\sum_{X^{\perp}_{Z}}Q(X^{\perp}_{Z},% Y,Z)}$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\frac{P(X^{\perp}_{Z}\,\|\,Y,Z)P(% Y)P(Z)}{\sum_{X^{\perp}_{Z}}P(X^{\perp}_{Z}\,\|\,Y,Z)P(Y)P(Z)}$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\frac{P(X^{\perp}_{Z}\,\|\,Y)P(Y)% P(Z)}{\sum_{X^{\perp}_{Z}}P(X^{\perp}_{Z}\,\|\,Y)P(Y)P(Z)}$
		$\displaystyle=P(X^{\perp}_{Z}\,\|\,Y),$

where $(1)$ holds by the definition of data balancing, $(2)$ holds by the assumption of conditional independence. Therefore, the l.h.s is not a function of $z$ which establishes conditional independence. ∎

Proposition B.6.

In general, $X^{\perp}_{Z}$ and $Z$ are not independent given $Y$ in $Q$ if $X^{\perp}_{Z}$ and $Z$ are not independent given $Y$ in $P$ ; that is data balancing does not induce independence between $X^{\perp}_{Z}$ and $Z$ if $X^{\perp}_{Z}$ and $Z$ are not independent given $Y$ in the original data distribution.

Proof.

Appendix C Impact of data balancing on the CBN

In the following we assume that $Z$ is discrete, but all the results remain valid for continuous $Z$ .

Proposition 5.1.

Let $\langle\mathcal{G},P\rangle$ be the CBN underlying the data, where $\mathcal{G}$ contains an undesired path between $Z$ and $Y$ , and let $\mathcal{G}^{0}$ be a modification of $\mathcal{G}$ in which the undesired path has been removed. The distribution $Q$ obtained by joint balancing the data to make $Z$ and $Y$ statistically independent, i.e. $Q(Y,X,Z)=P(X\,|\,Y,Z)P(Z)P(Y)$ , might not factorize according to $\mathcal{G}^{0}$ .

Proof.

Example 1: Causal task with causal and non-causal paths. Consider $\mathcal{G}=\{Z\rightarrow X\rightarrow Y,Z\leftarrow U\rightarrow Y\}$ , for unobserved $U$ . We have

\displaystyle Q(Y\,|\,X,Z)

\displaystyle=\frac{Q(X,Y,Z)}{\sum_{Y}Q(X,Y,Z)}=\frac{P(X\,|\,Y,Z)P(Z)P(Y)}{% \sum_{Y}P(X\,|\,Y,Z)P(Z)P(Y)}=\frac{P(X\,|\,Z,Y)P(Y)}{\sum_{Y}P(X\,|\,Z,Y)P(Y)},

where the r.h.s is a function of $Z$ in general as $X$ is not independent of $Y$ given $Z$ in $P$ . If $Q$ were $\mathcal{G}^{0}=\{Z\rightarrow X\rightarrow Y\}$ , then $Y\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,X$ in $Q$ . To show the claim it suffices therefore to construct a distribution $P$ such that $X$ is not independent of $Y$ given $Z$ .

Example 2: Causal task with non-causal path. Consider $\mathcal{G}=\{X\rightarrow Y,Z\leftarrow U\rightarrow Y\}$ . We have that,

\displaystyle Q(X\,|\,Z)=\frac{\sum_{Y}Q(X,Y,Z)}{\sum_{Y,X}Q(X,Y,Z)}=\frac{% \sum_{Y}P(X\,|\,Y,Z)P(Z)P(Y))}{\sum_{Y,X}P(X\,|\,Y,Z)P(Z)P(Y)}=\sum_{Y}P(X\,|% \,Y,Z)P(Y).

The r.h.s is a function of $Z$ in general as $X$ is not independent of $Z$ given $Y$ in a distribution $P$ consistent with $\mathcal{G}$ . Therefore, one may not interpret the mutilated graph $\mathcal{G}^{0}=\{X\rightarrow Y\}$ as a correct representation of the conditional independencies implied by the balanced distribution $Q$ .

Example 3: Causal task with causal path. Consider $\mathcal{G}=\{Z\rightarrow X\rightarrow Y\}$ . We have that,

\displaystyle Q(Y\,|\,X,Z)

\displaystyle=\frac{Q(X,Y,Z)}{\sum_{Y}Q(X,Y,Z)}=\frac{P(X\,|\,Y,Z)P(Z)P(Y)}{% \sum_{Y}P(X\,|\,Y,Z)P(Z)P(Y)}=\frac{P(X\,|\,Z,Y)P(Y)}{\sum_{Y}P(X\,|\,Z,Y)P(Y)},

The r.h.s is a function of $Z$ in general as $X$ is not independent of $Z$ given $Y$ in $P$ . Therefore, one may not interpret the mutilated graph $\mathcal{G}^{0}=\{Z,X\rightarrow Y\}$ as a correct representation of the conditional independencies implied by the balanced distribution $Q$ .

Example 4: Anti-causal task. Consider $\mathcal{G}=\{Y\rightarrow X,Z\leftarrow U\rightarrow Y,Z\rightarrow W% \rightarrow X\}$ . We have that,

\displaystyle Q(X\,|\,Z)=\frac{\sum_{Y,W}Q(X,Y,Z,W)}{\sum_{Y,X,W}Q(X,Y,Z,W)}=% \frac{\sum_{Y,W}P(X,W\,|\,Y,Z)P(Z)P(Y))}{\sum_{Y,X,W}P(X,W\,|\,Y,Z)P(Z)P(Y)}=% \sum_{Y}P(X\,|\,Y,Z)P(Y).

The r.h.s is a function of $Z$ in general as $X$ is not independent of $Z$ given $Y$ in a distribution $P$ consistent with $\mathcal{G}$ . Therefore, one may not interpret the mutilated graph $\mathcal{G}^{\prime}=\{Y\rightarrow X,Z\rightarrow W\rightarrow X\}$ as a correct representation of the conditional independencies implied by the balanced distribution $Q$ .

∎

C.1 Regularization and data balancing don’t always go hand in hand

C.1.1 Risk-invariance

We first consider the graph in Figure 1(d) and show that $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ in both $Q$ , which justifies its use in addition to data balancing, although there might not be a benefit of using both techniques simultaneously (in theory).

Proposition C.1.

Consider the graph $\mathcal{G}$ in Figure 1(d). Then $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ in both the training data distribution $P$ (consistent with $\mathcal{G}$ ) and the distribution after balancing, namely $Q$ .

Proof.

$X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ holds in the training data distribution $P$ by $d$ -separation. For the conditional independence in $Q$ , consider the following derivation,

	$\displaystyle Q(X^{\perp}_{Z}\,\|\,Y,Z)$	$\displaystyle=\frac{\sum_{X_{Y\wedge Z}}P(X^{\perp}_{Z},X_{Y\wedge Z}\,\|\,Z,Y)% P(Z)P(Y)}{\sum_{X_{Y\wedge Z},X^{\perp}_{Z}}P(X^{\perp}_{Z},X_{Y\wedge Z}\,\|\,% Z,Y)P(Z)P(Y)}$
		$\displaystyle=P(X^{\perp}_{Z}\,\|\,Z,Y)=P(X^{\perp}_{Z}\,\|\,Y)=g(X^{\perp}_{Z}% \,\|\,Y)$

The r.h.s is not a function of $Z$ and therefore $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ holds in $Q$ . ∎

However, when considering the graph in Figure 1(b), we introduce a dependence between $X^{\perp}_{Z}$ and $Z$ , which can be easily checked by the simulation Figure 8 in which we consider the simplified graph $Z\rightarrow Y\leftarrow X$ . While we are able to obtain the marginal dependence between $Y$ and $Z$ ( $\chi^{2}:p=0.34$ ), we introduce a dependence between $X$ and $Z$ ( $\chi^{2}:p<0.0001$ ).

⬇

import numpy as np

import scipy

# Number of samples.

n = 10000

# Generate binary data with simple data generating model Z -> Y <- X

x = 1*(np.random.normal(size=n) > 0)

u = 1*(np.random.normal(size=n) > 0.3)

y = 1*(x - u + 0.5*np.random.normal(size=n) > 0.5)

z = 1*(u - 0.5*np.random.normal(size=n) > 0.1)

# Marginal of z.

p_z = np.array([np.mean(z==i) for i in z])

# Marginal of y.

p_y = np.array([np.mean(y==i) for i in y])

# Joint of z and y.

p_zy = np.array([np.mean((z==i)&(y==j)) for i, j in zip(z,y)])

# Resampling probabilities

indep_probs = p_z * p_y / p_zy

indep_probs /= np.sum(indep_probs)

# Re-sample according to computed probabilities

indeces = np.random.choice(n, size=n, replace=True, p=indep_probs)

z_bal, x_bal, y_bal = z[indeces], x[indeces], y[indeces]

# Check that Y and Z are independent

# Create contingency table.

contingency_table_bal_zy = scipy.stats.contingency.crosstab(z_bal,y_bal)

# Implement chi squared test.

statistic, pvalue, _, _ = scipy.stats.chi2_contingency(contingency_table_bal_zy)

# Check whether X and Z are independent

contingency_table_bal_xz = scipy.stats.contingency.crosstab(z_bal,x_bal)

statistic, pvalue, _, _ = scipy.stats.chi2_contingency(contingency_table_bal_xz)

Figure 8: Python code to assess the impact of balancing in a numerical simulation of graph Figure 1(b).

C.1.2 When does data-balancing together with regularization lead to fair models?

This section gives several results to analyze the combination of data balancing implemented to generate independence between outcomes $Y$ and sensitive attributes $Z$ and regularization in two variants. First, regularizing to learn representations $W=\phi(X^{\perp}_{Z})$ such that $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ ; and second regularizing to learn representations $W=\phi(X^{\perp}_{Z})$ such that $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ . We write $X^{\perp}_{Z}\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Y$ to state that $X^{\perp}_{Z}$ and $Y$ are independent in distribution $P$ .

Regularization such that $\phi(X^{\perp}_{Z})\mathrel{\perp\mspace{-10.0mu}\perp}Z\,|\,Y$ .

Proposition C.2 (Demographic parity).

Balancing and regularization such that $W=\phi(X^{\perp}_{Z})$ and $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ is sufficient for demographic parity, i.e. $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ .

Proof.

\displaystyle Q(W\,|\,Z)=\sum_{Y}Q(W\,|\,Z,Y)Q(Y\,|\,Z)\stackrel{{\scriptstyle% (1)}}{{=}}\sum_{Y}Q(W\,|\,Y)Q(Y)=Q(W),

where (1) holds by the assumption of balancing in which $Z\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Y$ and regularization $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ . ∎

Proposition C.3 (Predictive parity).

Balancing and regularization such that $W=\phi(X^{\perp}_{Z})$ and $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ is sufficient for predictive parity, i.e. $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,W$ .

Proof.

\displaystyle Q(Z\,|\,Y,W)

\displaystyle=Q(Z\,|\,Y)=Q(Z),

where both equalities hold by the assumption of balancing in which $Z\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Y$ and regularization $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ . ∎

Proposition C.4 (Equalized odds).

Balancing and regularization such that $W=\phi(X^{\perp}_{Z})$ and $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ is sufficient for equalized odds, i.e. $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ .

Proof.

Regularization induces $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ and so equalized odds is satisfied by design. ∎

Remark: Note that balancing and regularization together are not always necessary, for example the section above shows that balancing on its own can be successful in some cases.

Regularization such that $\phi(X^{\perp}_{Z})\mathrel{\perp\mspace{-10.0mu}\perp}Z$ .

Proposition C.5 (Demographic parity).

Balancing and regularization such that $W=\phi(X^{\perp}_{Z})$ and $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ is sufficient for demographic parity, i.e. $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ .

Proof.

Regularization induces $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ and so demographic parity is satisfied by design. ∎

Proposition C.6 (Predictive parity).

Balancing and regularization such that $W=\phi(X^{\perp}_{Z})$ and $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ is not sufficient for predictive parity, i.e. $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,W$ does not hold.

Proof.

We give a counter-example. Let $A,B,C$ be three independent variables with values in $\{0,1\}$ . Let $X^{\perp}_{Z}=\mathbf{1}\{A=B\},Y=\mathbf{1}\{A=C\},Z=\mathbf{1}\{B=C\}$ . Let $Q$ be a probability distribution over $(X^{\perp}_{Z},Y,Z)$ . In particular, we could imagine $Q$ to be generated after balancing and regularization since $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ and $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ . However, conditioned on $X^{\perp}_{Z}$ , $Y$ and $Z$ determine each other and so predictive parity does not hold in $Q$ . ∎

Proposition C.7 (Equalized odds).

Balancing and regularization such that $W=\phi(X^{\perp}_{Z})$ and $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z$ is not sufficient for equalized odds, i.e. $W\mathrel{\perp\mspace{-10.0mu}\perp}_{Q}Z\,|\,Y$ does not hold.

Proof.

The counter-example above applies. ∎

Appendix D Experiments

D.1 Datasets

This work uses the MNIST [44, 17, http://yann.lecun.com/exdb/mnist/], Amazon reviews [52], ImageNet [16, https://image-net.org/] and CelebA [45, http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html] datasets, which are all openly accessible and can be used for research purposes.

MNIST semi-synthetic data: For simplicity, we binarize the digit recognition task to a label $Y\in\{0,1\}$ according to whether the number in the image is $<5$ or $\geq 5$ such that $Y$ matches the ground truth with probability $0.98$ . The top of the image is replaced by noise coloured in red for $Z=0$ and blue for $Z=1$ (see Figure 2). We can relate the confounder and the label such that $95\%$ (resp. $5\%$ ) of images with $Y=0$ have a red (resp. blue) noise pattern, while $10\%$ (resp. $90\%$ ) of the images with $Y=1$ have a red (resp. blue) pattern, corresponding to our original distribution $P$ . In this distribution, the marginal distributions of $Y$ and $Z$ are (close to) uniform. We sample $n=30,000$ samples from $P$ , as well as a dataset jointly balanced on $Y$ and $Z$ ( $Q$ , $n=30,000$ ). We also sample test data based on a ground truth $P^{0}$ generated with $P^{0}(Z=0|Y)=0.5$ ( $n=2,000$ ). Finally, we generate an $X^{\perp}_{Z}$ dataset that contains white instead of colored noise.

MNIST semi-synthetic data with added confounder: We add $V$ and $X_{V}$ to our data generating process where $X_{V}$ is a green cross either on the left or right of the image, with a fixed vertical position. The horizontal position of the cross is given by $V$ and $V$ is correlated with $Y$ ( $P(V=0|Y=0)=0.2$ , $P(V=0|Y=1)=0.9$ ). We generate a confounded dataset (95/10) as previously, which we balance jointly on $Y$ and $Z$ . We then train 5 replicates of the same architecture, and test our model on $Q$ , as well as on the ground truth $P^{0}$ where $P(V=0|Y=0)=P(V=0|Y=1)=P(Z=0|Y=0)=P(Z=0|Y=1)=0.5$ .

MNIST semi-synthetic data, entangled: We define the color of the noise based on an $\textsc{OR}(Y,Z)$ . We define $Q$ by generating samples with $P(Z=0|Y=0)=P(Z=0|Y=1)=0.5$ , while $P^{0}$ is represented by the disentangled test dataset described above.

Amazon reviews with confounder: We refer to Veitch et al. [73] and define a causal task based on Amazon reviews for the clothing category which predicts whether the review was found to be helpful (i.e. obtained ‘thumbs up’ votes) or not based on the review’s text. We generate a random variable $U$ as the unobserved confounder, and define $Y$ as the binary helpfulness label, randomly flip** the label based on $U$ (association: p=0.4). This leads to reviews with $Y=0$ being more associated with $U=0$ . We define $Z$ as $Z=\lambda*U+(1-\lambda)*U_{2}$ , where $U_{2}$ is another random variable distributed uniformly and $\lambda$ is a parameter that controls the relationship between $U$ and $Z$ , and by transitivity, between $Z$ and $Y$ . In $P$ , $\lambda$ is selected to be 0.8, leading to a correlation of 0.35 between $Y$ and $Z$ . To create $X^{\perp}_{Y}$ , we add perturbations to the text based on the value of $Z$ that wouldn’t (in theory) affect $Y$ . We select the words {and, the, you, my, they} and add a suffix ‘xxxx’ (resp. ‘yyyy’) when $Z=0$ (resp. $Z=1$ ). Finally, $Y$ is imbalanced, with only $5\%$ of the dataset with $Y=1$ . We hence re-balance the classes before the modelling. This operation is also performed by the joint balancing.

D.2 Metric definitions and operationalization

Our work focuses on statistical group fairness criteria [5]. These can be translated as independence criteria on the model’s predictions.

Definition D.1 (Demographic parity).

A predictor $f(X)$ is said to satisfy demographic parity w.r.t. sensitive attribute $Z$ and distribution $P$ if $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z$ .

Definition D.2 (Predictive parity).

A predictor $f(X)$ trained to predict an outcome $Y$ is said to satisfy predictive parity w.r.t. sensitive attribute $Z$ and distribution $P$ if $Y\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,f(X)$ .

Definition D.3 (Equalized odds).

A predictor $f(X)$ trained to predict an outcome $Y$ is said to satisfy equalized odds w.r.t. a sensitive attribute $Z$ and distribution $P$ if $f(X)\mathrel{\perp\mspace{-10.0mu}\perp}_{P}Z\,|\,Y$ .

In our experiments, we estimate equalized odds as in Alabdulmohsin & Lučić [1]. For this metric, the lower, the better.

	$\displaystyle EO$	$\displaystyle=0.5*\max_{z\in\mathcal{Z}}\,\mathbb{E}_{X}[f(X)\,\|\,Z=z,Y=0]\,-% \,\min_{z\in\mathcal{Z}}\,\mathbb{E}_{X}[f(X)\,\|\,Z=z,Y=0]$
		$\displaystyle+0.5*\max_{z\in\mathcal{Z}}\,\mathbb{E}_{X}[f(X)\,\|\,Z=z,Y=1]\,-% \,\min_{z\in\mathcal{Z}}\,\mathbb{E}_{X}[f(X)\,\|\,Z=z,Y=1].$

In terms of robustness metrics, we evaluate a simplified version of risk-invariance by computing model performance on a test set sampled from $P$ , and contrasting this result with the model’s performance on a test set sampled from $P^{0}$ (when known), or from $Q$ . We also estimate worst-group performance [63] as:

WG=\min_{z^{\prime}\in\mathcal{Z}}\,\operatorname{\mathbb{E}}_{X,y}[\mathbbm{1% }[f(X)=y]\,|\,z=z^{\prime}]

An invariant model that is optimal would hence display high performance on both $P$ and $P^{0}$ / $Q$ , as well as high worst-group accuracy.

Metrics like risk-invariance or equalized odds provide insights on the model’s outputs, but do not probe the model’s representation. As we are interested in large-scale models that might be further fine-tuned, it is important to understand whether the model’s representation is invariant on $\mathcal{P}$ . Defining a representation as $\phi(X)$ , we can write $f(X)=h(\phi(X))$ in which we assume the representation to be fixed (i.e. frozen model weights) and $h$ is a learnable function. In Zemel et al. [80], the authors define a fair representation w.r.t. a binary $Z$ as demographic parity on the representation:

\operatorname{\mathbb{E}}_{X\in X^{Z=z}}\phi(X)=\operatorname{\mathbb{E}}_{X% \in X^{Z=z^{\prime}}}\phi(X),\forall z,z^{\prime}\in\mathcal{Z},

where $X^{Z=z}$ corresponds to the samples with $Z=z$ . This is equivalent to assessing the ‘encoding’ of $Z$ in $\phi(X)$ , by training a linear layer $h:\phi(X)\rightarrow Z$ [27, 8]. Chance level performance of $h(\phi(X))$ would then suggest that the representation is independent of $Z$ . In the present work, we estimate the encoding of $Z$ using $P^{0}$ or $Q$ such that assessing the encoding of $Z$ is equivalent to assessing the encoding of $Z|Y$ . Models that encode less of the auxiliary factor $Z$ have been shown to reach a more ‘global’ optimum compared to models that encode the signal more strongly [independently of whether invariant predictions are obtained 79].

D.3 Model architectures

We consider multiple architectures in this work, with an attempt to cover different model sizes and characteristics.

•

Small convolutional network, similar in spirit to AlexNet [42]. It includes 5 convolution blocks with kernel sizes (4, 3, 2, 2, 2, 2) and output channels (3, 6, 9, 12, 12, 9), with max pooling after each convolution, as well as two dense layers with Relu non-linearity before the output head.
•

VGG network [67] with square kernels of size 3, output channels of dimensions (64, 64, 128, 128, 128, 256, 256, 256, 512, 512, 512) and strides (1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1).
•

Vision Transformers [18] of different sizes: ViT-micro (17M parameters), ViT-Tiny (44M), ViT-S (174M) and ViT-B (690M), with the Tiny sizes and up taken from [72].
•

For text data, we use the BERT architecture, as defined in TensorFlow Hub.

We use a stochastic gradient descent optimizer with Nesterov momentum of $0.9$ for all models.

D.3.1 Hyper-parameter searches

We include a hyper-parameter search over the learning rate (5 values in log-scale between $9e-5$ and $0.1$ ) coupled with a batch size search between sizes of 128, 256 and 512 examples. In terms of regularization, the small convolutional network include dropout in the dense layers (search on 0.1, 0.2, 0.3), while VGG includes batch normalization in the dense layers (as per their original implementations). We impose an L2-regularization of $1e-4$ during training for all architectures.

We note that hyper-parameters did not seem to make a difference on the MNIST results. For VGG, there was a larger variation, as well as a larger variance across multiple seeds.

When performing MMD conditional regularization, we vary the strength of the regularizer in $[0.0,0.1,0.2,0.5,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.]$ , with 5 replicates for each value. To minimize computational expenses, we fix the learning rate to $0.001$ , dropout rate to $0.1$ and batch size to $64$ (for downsampled datasets) or $256$ .

D.4 Assets, code and resources

We use the BERT model bert_en_uncased_L-12_H768_A-12 from TensorFlow Hub. All other models are trained from scratch in our code infrastructure written in Python and JAX [7]. The results are then analyzed with Python and the numpy [30], matplotlib [32, https://matplotlib.org/] and pandas [49, https://pandas.pydata.org/] packages. For the small convolutional networks, training was performed with 4 GPUs (V100) and evaluation used 1 GPU per model instance. BERT used 2 Tensor Processing Units (TPUs) for training and 1 TPU for evaluation. For all other models, we used 4 Tensor Processing Units for training and 1 TPU or GPU (P100) for evaluation. We note that, apart from ViT-B and BERT, all experiments could be run on CPU.

Appendix E Results

E.1 Failure modes of data balancing with MNIST

Other confounder

We notice that correlation between $V$ and $Z$ in $Q$ is decreased ( $\rho=-0.16$ ) compared to $P$ ( $\rho=-0.60$ ) but is not null. In addition, we observe that the model relies on $V$ (accuracy on $Q$ : $0.769\pm 0.008$ , on $P^{0}$ : $0.647\pm 0.023$ ). As a consequence, models trained on $Q$ display a bias w.r.t. $Z$ (see equalized odds and worst group performance).

Entangled signals

During training, the model reaches $0.903\pm 0.011$ accuracy on $Q$ , but only $0.672\pm 0.004$ accuracy on $P^{0}$ . Worst-group accuracy is low and equalized odds high, displaying a failure mode of data balancing.

E.2 Celeb-A

E.2.1 Model performance

Model encoding and performance across different model sizes is displayed in Figure 9. We show that all models trained on the subsampled data display an encoding of the auxiliary factor $Z$ .

E.2.2 Distinguishing between failure modes

Correlation patterns in balanced data We plot the Pearson correlation between $Y$ and all other available attributes (39 in CelebA) in Figure 10 (left), and similarly for $Z$ (right). We note that the correlation that increases most when balancing the data is between $Y$ and the ‘black hair’ label. As this label has a low correlation with $Z$ , this does not seem problematic. We also observe smaller changes in attributes related to hair (‘bushy-eyebrows’, ‘bald’) and accessories (‘wearing-hat’).

	$\displaystyle\operatorname{\mathbb{E}}[Z]-\frac{1}{2}$	$\displaystyle=\operatorname{\mathbb{E}}[Z\,\|\,S]-\frac{1}{2}+\left(P(Y=1)-% \frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=1]-\frac{1}{2}\right% )+\left(P(Y=0)-\frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=0]-% \frac{1}{2}\right)$
		$\displaystyle=\operatorname{\mathbb{E}}[Z\,\|\,S]-\frac{1}{2}+\left(P(Y=1)-% \frac{1}{2}\right)\operatorname{\mathbb{E}}[Z\,\|\,Y=1]+\left(P(Y=0)-\frac{1}{2% }\right)\operatorname{\mathbb{E}}[Z\,\|\,Y=0]$
		$\displaystyle=\operatorname{\mathbb{E}}[Z\,\|\,S]-\frac{1}{2}+\left(P(Y=1)-% \frac{1}{2}\right)\left(\operatorname{\mathbb{E}}[Z\,\|\,Y=1]-\operatorname{% \mathbb{E}}[Z\,\|\,Y=0]\right).$

	$\displaystyle P^{\prime}(X^{\perp}_{Z}\,\|\,Y)$	$\displaystyle=\sum_{Z}P^{\prime}(X^{\perp}_{Z}\,\|\,Y,Z)P^{\prime}(Z\,\|\,Y)$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\sum_{Z}Q(X^{\perp}_{Z}\,\|\,Y,% \cancel{Z})P^{\prime}(Z\,\|\,Y)$
		$\displaystyle=Q(X^{\perp}_{Z}\,\|\,Y).$

	$\displaystyle Q(Y\,\|\,R,X^{\perp}_{Z})$	$\displaystyle=\frac{\sum_{Z}Q(R,X^{\perp}_{Z},Y,Z)}{\sum_{Z,Y}Q(R,X^{\perp}_{Z% },Y,Z)}$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\frac{\sum_{Z}P(R,X^{\perp}_{Z}% \,\|\,Y,Z)P(Z)P(Y)}{\sum_{Z,Y}P(R,X^{\perp}_{Z}\,\|\,Y,Z)P(Z)P(Y)}$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\frac{\sum_{Z}P(R\,\|\,\cancel{X^% {\perp}_{Z},Y},Z)P(X^{\perp}_{Z}\,\|\,Y,\cancel{Z})P(Z)P(Y)}{\sum_{Z,Y}P(R\,\|\,% \cancel{X^{\perp}_{Z},Y},Z)P(X^{\perp}_{Z}\,\|\,Y,\cancel{Z})P(Z)P(Y)}$
		$\displaystyle=\frac{P(R)P(X^{\perp}_{Z}\,\|\,Y)P(Y)}{P(R)\sum_{Y}P(X^{\perp}_{Z% }\,\|\,Y)P(Y)}$
		$\displaystyle=P(Y\,\|\,X^{\perp}_{Z}),$

	$\displaystyle\operatorname{\mathbb{E}}_{P^{\prime}}[Y\,\|\,f(X)]$	$\displaystyle=\operatorname{\mathbb{E}}_{P^{\prime}}[Y\,\|\,h(\phi^{\perp}_{Z}(% X))]$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\operatorname{\mathbb{E}}_{P^{% \prime}}[Y\,\|\,h(X^{\perp}_{Z})]$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\operatorname{\mathbb{E}}_{P}[Y% \,\|\,h(X^{\perp}_{Z})]$

	$\displaystyle Q(X^{\perp}_{Z}\,\|\,Z)$	$\displaystyle=\frac{\sum_{Y}Q(X^{\perp}_{Z},Y,Z)}{\sum_{Y,X^{\perp}_{Z}}Q(X^{% \perp}_{Z},Y,Z)}$
		$\displaystyle\stackrel{{\scriptstyle(1)}}{{=}}\frac{\sum_{Y}P(X^{\perp}_{Z}\,\|% \,Y,Z)P(Y)P(Z)}{\sum_{Y,X^{\perp}_{Z}}P(X^{\perp}_{Z}\,\|\,Y,Z)P(Y)P(Z)}$
		$\displaystyle\stackrel{{\scriptstyle(2)}}{{=}}\frac{\sum_{Y}P(X^{\perp}_{Z}\,\|% \,Y)P(Y)P(Z)}{\sum_{Y,X^{\perp}_{Z}}P(X^{\perp}_{Z}\,\|\,Y)P(Y)P(Z)}$
		$\displaystyle=P(X^{\perp}_{Z}),$

Mind the Graph When Balancing Data for Fairness or Robustness

Abstract

1 Introduction

Definition 1.1 (Jointly balanced distribution).

2 Preliminaries

Definition 2.1 (Optimality).

2.1 Desired criteria on a model’s predictions

Definition 2.2 (Risk-invariance).

2.2 Causal framework to analyse data balancing

Assumption 2.3 (Form of Covariates X𝑋Xitalic_X).

3 Can we predict when data balancing fails?

4 Conditions for data balancing to produce an invariant and optimal model

Definition 4.1.

Proposition 4.2.

Corollary 4.3.

Proposition 4.4 (Disentangled representation).

5 Impact of data balancing on the CBN

Proposition 5.1.

5.1 Data balancing can hinder regularization and vice-versa

6 Case study: distinguishing between failure modes in CelebA

7 Related works

8 Discussion

Broader impact

Acknowledgments and Disclosure of Funding

References

Appendix A Failure modes of data balancing

A.1 Failure mode: Balancing on one variable can increase bias

Formalization and proof.

Proposition A.1.

Proof of Proposition A.1..

Simulation.

A.2 Failure mode: entangled signals

Appendix B Conditions for data balancing to lead to an invariant and optimal model

B.1 Risk-invariant, optimal model

Proposition 4.2.

Proof.

Corollary 4.3.

Proof.

Proposition 4.4.

Proof.

B.2 Conditions for data balancing to lead to a fair model

Proposition B.1 (Demographic parity).

Proof.

Proposition B.2.

Proof.

Proposition B.3 (Predictive parity).

Proof.

Proposition B.4.

Proof.

Proposition B.5 (Equalized odds).

Proof.

Proposition B.6.

Proof.

Appendix C Impact of data balancing on the CBN

Proposition 5.1.

Proof.

C.1 Regularization and data balancing don’t always go hand in hand

C.1.1 Risk-invariance

Proposition C.1.

Proof.

C.1.2 When does data-balancing together with regularization lead to fair models?

Proposition C.2 (Demographic parity).

Proof.

Proposition C.3 (Predictive parity).

Proof.

Proposition C.4 (Equalized odds).

Proof.

Proposition C.5 (Demographic parity).

Proof.

Proposition C.6 (Predictive parity).

Proof.

Proposition C.7 (Equalized odds).

Proof.

Appendix D Experiments

D.1 Datasets

D.2 Metric definitions and operationalization

Definition D.1 (Demographic parity).

Definition D.2 (Predictive parity).

Definition D.3 (Equalized odds).

D.3 Model architectures

Assumption 2.3 (Form of Covariates $X$ ).