Mind the Graph When Balancing Data for Fairness or Robustness
Abstract
Failures of fairness or robustness in machine learning predictive settings can be due to undesired dependencies between covariates, outcomes and auxiliary factors of variation. A common strategy to mitigate these failures is data balancing, which attempts to remove those undesired dependencies. In this work, we define conditions on the training distribution for data balancing to lead to fair or robust models. Our results display that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies in a causal graph of the task, leading to multiple failure modes and even interference with other mitigation techniques such as regularization. Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.
1 Introduction
When training prediction models, practitioners often desire that the model’s outputs display safety properties in addition to high performance, such as being fair across demographic subgroups [29, 50] or being robust to distribution shifts [e.g. 19, 58]. These objectives can be difficult to attain if there are undesired dependencies between covariates , labels , and auxiliary factors of variation , such as confounding factors or hidden stratification [26, 27]. A commonly referenced example is that of an animal classification task from wildlife pictures [e.g. 63]: the model might identify patterns in the background of the images that are indicative of the type of animal (e.g. the presence of snow for polar bears or grass for cows), which might lead to the model failing to recognize the same animal when it is on another background. When the auxiliary factors relate to demographic attributes, the deployment of such models can have societal implications, e.g. patients not being assigned medical resources due to factors related to race [53].
Multiple mitigation strategies have been proposed to remove undesired dependencies pre-, in- or post-processing. Amongst them, balancing the training data is typically considered a straightforward approach and has been used or researched in various settings [e.g. 37, 38, 59, 8, 33, 39, 2]. This approach modifies the training distribution, indicated with , into a new, balanced distribution (which we refer to as ) that aims to approximate an ‘idealized’ training distribution in which the undesired dependencies are absent [47, 14, 76]. Models are then trained on this balanced distribution to attain different fairness or robustness criteria. A popular approach to construct a balanced distribution is by balancing classes (resp. groups), leading to a uniform distribution over (resp. ). While successful for addressing failures of robustness [e.g. 33] or of fairness due to under-representation of certain groups [e.g. 74], this approach does not induce independence between and . To approximate independence, a ‘joint’ balancing on is often performed [e.g. 47, 8]. Joint balancing can be implemented by matching the numbers of samples in all groups (only feasible when and have small, discrete domains) via subsampling the majority groups [e.g. 8], upsampling the minority groups [e.g. 62], resampling the data with weights proportional to , or reweighting the loss [9]. Our work focuses on joint balancing given its suitability to mitigate a marginal dependence between and .111We briefly discuss group or class data balancing in Appendix A.1. While the choice of the method for jointly balancing can impact the results [11, 64, 33], these methods can be seen as modifying as described in Definition 1.1.
Definition 1.1 (Jointly balanced distribution).
We say that the distribution is a jointly balanced version of if .
In some cases, data balancing has proven to be an effective mitigation strategy for undesired dependencies, performing on-par with other, more complex mitigation techniques [33]. Recently, data balancing has also shown promises for mitigation during fine-tuning or partial retraining [40, 43, 48, 78, 74], which is relevant to the settings of training large-scale models and with large amounts of data. Nevertheless, data balancing has also displayed failure modes in which the obtained models were not fair, robust or optimal [75, 47, 57, 2]. These failure modes have not been thoroughly characterized and can be difficult to predict. Furthermore, the impact of data balancing on other mitigation strategies has not been studied extensively.
Given data balancing’s popularity as a baseline mitigation strategy for undesired dependencies, we aim to formalize some of its promises and pitfalls. Our analysis relies on a causal graphical framework, which allows investigating the impact of data balancing in different data generating processes. Our contributions can be summarized as follows:
-
•
We display failure modes of data balancing in semi-synthetic tasks and highlight how predicting these failures can be challenging.
-
•
We introduce necessary and sufficient conditions for data balancing to attain invariance to undesired dependencies as defined by fairness or robustness criteria.
-
•
We prove that data balancing does not correspond to ‘removing’ undesired dependencies from a causal perspective, and can negatively impact fairness or robustness criteria when combined with regularization strategies.
-
•
We illustrate how our findings can be used to distinguish between failure modes and identify next steps.
2 Preliminaries
Let , , be random variables with corresponding to a set of covariates (e.g. tabular, images or text), to an outcome to be predicted, and to an auxiliary factor of variation, such as a sensitive attribute or the type of background of an image, that displays statistical dependence with in the original, training distribution . We consider a prediction model that is trained on data from distribution to minimize the risk where is a loss function. We call optimal on if the risk attains the minimum for .
Definition 2.1 (Optimality).
A prediction model is optimal on if .
2.1 Desired criteria on a model’s predictions
While a model may be optimal on , it might not be optimal on another distribution of interest (e.g. in deployment), and/or might display disparities across subsets of the data (e.g. ) [22]. To mitigate this issue, multiple safety criteria have been defined in the fields of fairness and robustness.
Fairness: Fairness criteria can be defined in terms of the dependence between the model’s output and the auxiliary factor of variation . We consider established fairness criteria [5, 50], including demographic parity [, 23], equalized odds [, 29] and predictive parity [, 24]. Beyond fairness of , we also consider fairness of intermediate representations , e.g. [80], for their usage in downstream tasks.
Robustness: In this field, the focus is typically on finding models parameterized by that provide the lowest risk across a family of target distributions . For instance, the ‘worst group performance’ criterion aims to select parameters such that the performance on a ‘worst’ distribution is optimized, i.e. [6, 20]. can be defined so that each distribution represents a specific subpopulation [63], to minimize the loss in each subgroup, or aiming for an invariance of across subgroups [risk-invariance 47].
Definition 2.2 (Risk-invariance).
A prediction model is risk-invariant w.r.t. the family of distributions if .
If a model is optimal on and risk-invariant w.r.t. , it is also optimal w.r.t. . The choice of is context-specific and reflects some domain knowledge about shifts that are likely to arise in a given application. For instance, a plausible family of target distributions could imply a shift in the dependence between and , also known as a correlation shift [61], and be expressed as . Alternatively, we can define using a causal framework (see Section 2.2) when the data generation process is known [47].
We acknowledge that selecting amongst those criteria is context-dependent and do not advocate for a specific choice. We call a prediction model invariant to undesired dependencies, denoted with , if it satisfies one of such criteria. For brevity, we focus on risk-invariance in the main text and consider fairness criteria in Appendix. Obtaining an invariant model can be performed in different ways, with data balancing being a popular approach.
2.2 Causal framework to analyse data balancing
To understand the effects of data balancing, we need to investigate its impact on the distribution . A causal formalization is useful for studying how distributions change under different interventions. To analyse the implications of data balancing, we use the framework of causal Bayesian networks (CBNs) [e.g. 70, 13, 51, 73, 25, 47]. A Bayesian network [54, 55, 15, 41] is a pair , in which is a directed acyclic graph whose nodes represent random variables and in which is a joint distribution over the nodes. The absence of edges in implies a set of statistical independence assumptions satisfied by that can be expressed by the factorization , where denote the parents of , namely the nodes with an edge into (we say that factorizes according to ). A CBN is a Bayesian network in which an edge expresses causal influence, so that are direct causes of . A directed path between and in a CBN is also called a causal path. A non-directed path, also called non-causal path, expresses statistical dependence of non-causal nature. We refer to the statistical dependence between and that arises only due to the presence of non-causal paths as purely spurious. In our setting where are unobserved variables. Inspired by prior work [73, 3, 69, 76], we make the following assumption on the form of the covariates .
(a) Anti-causal
Purely spurious
(b) Causal
Purely spurious
(c) Anti-causal
Factor of variation
(d) Anti-causal
Entangled data
Assumption 2.3 (Form of Covariates ).
In the system defined by with , decomposes as , where is a function of that does not have causal paths to/from but has causal paths to/from , is a function of that does not have causal paths to/from but has causal paths to/from , and is a function of that has causal paths to/from both and , representing entangled signals.
In the animal classification example, would correspond to the animal pixels, to the background pixels (e.g. snowy or grassy landscape), and to characteristics of the animal that depend on its environment (e.g. color of the fur pixels in bears). Intuitively, we want to build a prediction model that only depends on the animal pixels. While the decomposition may be readily available when a causal graph of the application is available and the data is tabular, we typically do not have direct access to the different functions of and these need to be isolated algorithmically.
Following Schölkopf et al. [65], we consider both the case in which are direct causes of the label (causal task) e.g. estimating the helpfulness of a text review, and the case in which is a direct cause of (anti-causal task) as in object detection tasks in computer vision. Figures 1(a-b) display examples of anti-causal and causal tasks with a purely spurious dependence between and . It is important to note that statistical relationships between the different variables and functions of are determined by the graph: for instance, in Figure 1(a) , while in Figure 1(b) .
Based on a CBN of the task and Assumption 2.3, we characterize undesired dependencies as the presence of undesired paths between and , which we indicate through red edges (Figure 1). Based on this depiction of undesired dependencies, we can define the family of target distributions such that black edges are preserved, but those in red may lead to changes in the distribution. For the anti-causal task in Figure 1(a), we can hence write in which represents any distribution but all other causal mechanisms are fixed [47], which corresponds to a correlation shift.
3 Can we predict when data balancing fails?
As reported previously, data balancing can display failure modes, e.g. due to the presence of other confounders [75, 2], finite sampling effects [47] or a dependence between and when conditioning on () [57]. However, this list is non-exhaustive and, to the best of our knowledge, there is no unifying study of those failure modes or of how they could be mitigated. In this section, we perform joint data balancing on different tasks to illustrate that successes and failures of this approach can be difficult to predict. For details of the experiments, see Appendix D.
Let’s first consider semi-synthetic examples generated from the graphs in Figure 1(a,b), i.e. an anti-causal and causal task with a purely spurious correlation. We aim to obtain a risk-invariant and optimal model on these tasks by training on the jointly balanced distribution .
Anti-causal task: number detection in MNIST. Inspired by Brown et al. [8], we modify MNIST images [44, 17] by adding a factor of variation such that the top of the image is replaced by red noise for and blue noise for (Figure 2). We sample a dataset in which the factor of variation and label are dependent (, , called the ‘confounded’ data), a jointly balanced dataset, and a dataset from a distribution in which the undesired dependency is absent (). We train convolutional networks to predict whether the number in an image is smaller or larger than 5, assessing the models on their training distribution and on .
Models trained with confounded data (95/10) display biased outputs (Table 1), with low worst group performance and high equalized odds. Performance on is also lower compared to that on (), showing that these models are not risk-invariant w.r.t. . Models trained from balanced data obtain high overall performance and worst group accuracy, as well as low equalized odds. In addition, we were not able to decode from the model representation , suggesting that the model has not learned .222This result is interesting as an addition across the channels of the raw image allows to discriminate red from blue samples, and colors can easily be discriminated from a model trained to predict from scratch (accuracy=100%). We therefore show that the model is not performing any ‘incidental’ learning of . Our results suggest that data balancing led to a fair/robust and optimal model.
Causal task: helpfulness of reviews with Amazon reviews [52]. Inspired by Veitch et al. [73], we refer to the causal task of predicting the helpfulness rating of an Amazon review (thumbs up or down, ) from its text (). We add a synthetic factor of variation such that words like ‘the’ or ‘my’ are replaced by ‘thexxxx’ and ‘myxxxx’ () or ‘theyyyy’ and ‘myyyyy’ (). We train a BERT [34] model on a class-balanced version of the data for reference (due to high class imbalance), and compare to a model trained on jointly balanced data, both evaluated on their training distribution and on a distribution with no association.
In this case, jointly balancing improves fairness and risk-invariance, with the model’s performance on the training distribution (acc.: ) being similar to that on (Table 1). This however comes at a high performance cost when compared to the class balanced model’s performance on (acc: ). Therefore, data balancing might not to lead to optimality for this causal task.
![Refer to caption](extracted/5687372/figs/MNIST_samples.png)
Task | Dataset | Acc. () | Worst Grp () | Encoding () | Equ. Odds () |
---|---|---|---|---|---|
Anti-causal (a) | 95/10 | ||||
Anti-causal (a) | Balanced | ||||
Causal (b) | Class bal. | ||||
Causal (b) | Jointly bal. | ||||
Anti-causal (c) | With | ||||
Anti-causal (d) | Entangled |
Using the same framework, we can replicate the failure modes due to another confounder described in Wang et al. [75], Alabdulmohsin et al. [2] as well as that from Puli et al. [57].
Anti-causal task with another factor of variation . It is common for multiple auxiliary factors to influence the data generating process, and they tend to correlate with each other [e.g. 21]. To emulate this case, we introduce more unobserved variables as well as a factor of variation which affects the data through (Figure 1(c)).333 and its dependencies to were selected to describe an example without entangled data, but the results hold for . We modify the MNIST data generation to include depicted by a green cross on the top left or top right of the image and jointly balance the data on before training the model. We evaluate the obtained predictor on a distribution where and are not correlated with and observe (Table 1) a large gap between worst group accuracy and overall performance, as well as non-null equalized odds. These results suggest that the model is not fair or robust, and also displays a decrease in performance compared to the model trained on data without .
Anti-causal task with entangled data. We map the work in Puli et al. [57] to our decomposition of and propose the example graph in Figure 1(d) where represents an entangled function of . To match this data generating process, the color of the noise in MNIST samples is defined by and the evaluation distribution is the disentangled with no dependence between and . Once again, the obtained model is not fair, robust or optimal (Table 1). Appendix A.2 discusses this case further.
Motivated by these examples of both success and failures, we define necessary and sufficient conditions for the success of data balancing, and highlight when the cases above fail to meet these conditions.
4 Conditions for data balancing to produce an invariant and optimal model
In this section, we introduce necessary and sufficient conditions that, taken together, lead to a risk-invariant and optimal prediction model after training on (proofs in Appendix B.1). In Appendix B.2, we derive similar conditions for fairness criteria. Throughout the rest of the paper, we use an underscore to indicate under which of or a statistical independence holds, e.g. to indicate .
We consider the criterion of risk-invariance (Definition 2.2) under correlation shift, i.e. . According to our decomposition of , the risk-minimizing function should only be a function of and not of or . To achieve this result with data balancing, we build on a prior result by Makar et al. [47], which shows that a model trained on a balanced distribution only depends on if represents a sufficient statistic for , i.e. no other part of influences .
Definition 4.1.
(Sufficient Statistic) We say that is a sufficient statistic for in if .
Definition 4.1 implies that the risk-minimizing function for does not vary with . However, this condition is not sufficient on its own to ensure that is risk-invariant w.r.t. , as or may have non-causal relationships with . To ensure optimality and risk-invariance w.r.t. , we derive the sufficient condition in Proposition 4.2.
Proposition 4.2.
If and is a sufficient statistic for in , then the risk-minimizer is risk-invariant and optimal w.r.t. .
The conditions of Proposition 4.2 concern . However, it would be of interest to express them in if it is possible to observe all covariates (e.g. in the case of tabular data). Based on our expression for , we can derive sufficient conditions on , expressed in Corollary 4.3. Let’s denote by .
Corollary 4.3.
If and , then the risk-minimizer is risk-invariant and optimal w.r.t. .
In general, we can expect that anti-causal tasks with purely spurious correlations will satisfy these conditions, as per their definition. However, this would not be the case for most causal tasks as . This result is in line with our findings in Section 3, as the MNIST data generated from the graph in Figure 1(a) validates Corollary 4.3, but the Amazon reviews data generated from Figure 1(b) does not.
It may be less obvious, but the conditions for a sufficient statistic are not met in Figures 1(c,d) as in the case of another factor of variation , and in the case of entangled data. We hence see that when a causal graph of the application is available, Corollary 4.3 can provide indicators on when data balancing might succeed or fail.
While Proposition 4.2 and its corollary provide conditions on the data generating process, prior work [e.g. 10, 31] has demonstrated that the learning strategy of also influences the model’s fairness and robustness characteristics. As data balancing on its own does not control the learning strategy, we need to define conditions on to ensure risk-invariance and optimality. To this end, we assume that the penultimate representation can be decomposed into , and such that is disentangled, i.e. . We can define the following condition for risk-invariance and optimality of where is a linear transformation of .
Proposition 4.4 (Disentangled representation).
Let be disentangled with and be a linear function. The risk-minimizer is optimal and risk-invariant w.r.t. if , is a sufficient statistic for in and .
In Proposition 4.4, we require that the representation does not ’loose’ information about or mixes it with information from . We note that such a representation can be obtained even if the data is entangled, e.g. by drop** modes of variation during training. Unlike other strategies [4, 47, 57], data balancing cannot enforce this property on its own and a disentangled representation is considered as necessary. This condition hence suggests another failure mode of data balancing when the conditions on the data are validated, but the representation is of low quality. We believe this failure mode is displayed in Kirichenko et al. [40], as the success of their data balancing mitigation only holds when using models pre-trained on large datasets.
In this section, we have identified conditions for data balancing to be successful. In the next section, we go one step further to understand how data balancing impacts the data generating process, and how it interacts with other mitigation strategies for undesired dependencies, focusing on regularization.
5 Impact of data balancing on the CBN
Joint data balancing is assumed to remove statistical dependence between and while kee** other relationships in the CBN of the task unaffected [e.g. 47, 76, 14]. This could be interpreted as ‘drop**’ edges in the undesired paths in , e.g. removing the influence of on and/or in Figure 1(a), leading to a new graph . While this interpretation is correct for joint balancing in the case of Figure 1(a), Proposition 5.1 below (proof in Appendix C) shows that it can be erroneous in general: the distribution underlying the balanced data might not factorize according to and therefore might not obey the statistical dependence relationships implied by . Therefore, balancing data to make and statistically independent, i.e. selecting samples in proportion to , is not equivalent to generating data from a distribution that factorises according to in general. This factorization is important because downstream distributions are often assumed to follow this factorization; in fact, this assumption underlies a number recommendations for applying regularization methodologies such as in [73].
Proposition 5.1.
Let be the CBN underlying the data, where contains an undesired path between and , and let be a modification of in which the undesired path has been removed. The distribution obtained by jointly balancing the data need not factorize according to .
Proposition 5.1 shows that statistical (in)dependencies that we assumed would remain fixed (i.e. the black edges on the graph) can be modified by the process of joint balancing. As a consequence, further interventions on (e.g. the addition of a regularizer) should not be motivated by , and we show below that combining data balancing with other mitigation strategies can lead to unexpected results.
5.1 Data balancing can hinder regularization and vice-versa
![Refer to caption](extracted/5687372/figs/Model_mnist_acc_mmd_failures.png)
When confronted with a failure mode, it is reasonable to ask whether an additional fairness or robustness regularizer might be beneficial. Based on Proposition 5.1, we see that this question might have a different answer if we are in or in . Below, we consider each failure mode and ask whether performing an additional regularization motivated by the literature would mitigate the undesired dependencies in . In Appendix C.1.2, we discuss when balancing with regularization is sufficient for different fairness criteria.
Anti-causal task. In the case of an anti-causal task with a dependence between and (Figures 1(a,c,d)), Veitch et al. [73] recommend to impose an independence between and conditioned on . If we consider both the purely spurious correlation and the entangled case, we see that regularization and data balancing would have the same effects of blocking any dependence between and . We demonstrate that in both and (see Appendix C.1), and this regularization is sensible under both distributions. This means that performing the regularization provides the sufficient conditions for a risk-invariant model, whether or not joint data balancing is performed. In theory, data balancing is not needed but is also not harmful. In the case of an added confounder, we have that depends on both and due to non-causal paths through . Therefore, imposing that might lead to results whereby the model only depends on or is trivial (e.g. predicts a constant) as the regularization encourages the removal of any dependence on , which is related to via . This behavior would be observed in both and , but data balancing on its own might be less detrimental than regularization in terms of predictive power even though it does not resolve all undesired dependencies. In this case, regularization hinders data balancing.
Based on the balanced data from Section 3, we add a conditional Maximum Mean Discrepancy [MMD, 28] to encourage during training, varying the strength of this regularizer via a hyper-parameter. In the case of the purely spurious statistical dependence between and (Figure 1(a)), there is little variation between the metrics across MMD strengths, and the model is fair and robust (Figure 3(left)). In the entangled case (Figure 3(right)), the model’s performance on and are close for medium values of the hyper-parameter (before MMD overpowers the training) and worst group performance improves markedly. This result suggests that, with the added regularizer, only varies with . Performing the same regularization in the presence of another confounder (Figure 3(middle)) leads to a plateau in performance on , but low performance on and chance-level worst group performance. In this case, we posit that the model relies exclusively on for its predictions, and the regularizer is detrimental compared to data balancing on its own (MMD=0 on the plot).
Causal task. Finally, let us consider the causal task in Figure 1(b). In a similar case, Veitch et al. [73] suggests a regularizer such that , which would encourage the model to vary only with as . However, data balancing induces a dependence between and , as expressed below:
The RHS cannot be simplified further because , because is a collider under . Thus, the left hand side is a function of in general (see Appendix C.1 for further details and a numerical simulation). In this case, regularizing to enforce would destroy information in , whereas the same regularization under would have enabled to use all of the information in . Therefore, data balancing may hinder regularization.
We illustrate this result on the Amazon reviews dataset from Section 3 by imposing a marginal MMD regularization during training and evaluating risk-invariance across multiple . When training on , we observe that the regularization allows to ’flatten’ the curve, such that from medium to high values of MMD regularization, the model is risk-invariant (Figure 4(a)). On the jointly balanced data, medium values of the regularization degrade risk-invariance (see green curves on Figure 4(b)). Overall, model performance is also lower for the models trained on compared to models trained on across test sets from , at similar levels of regularization (see Figure 4(c) for MMD=16). This result displays that is not a sufficient statistic for in .
(a) Trained on
(b) Trained on
(c) MMD=16
![Refer to caption](extracted/5687372/figs/amazon_reviews_conf_acc_class_balanced.png)
![Refer to caption](extracted/5687372/figs/amazon_reviews_conf_acc_joint_balanced.png)
![Refer to caption](extracted/5687372/figs/mmd16_amazon.jpg)
6 Case study: distinguishing between failure modes in CelebA
In this section, we show that when and are available at training time, we can try to distinguish between failure modes of data balancing by using our different observations, even in the absence of a full causal graph. We illustrate this using the benchmark task of detecting blond hair in pictures of celebrities in the CelebA [45] dataset. This label has a strong correlation with perceived gender: half of the non-males have blond hair, while only of males do. We consider a balanced, subsampled dataset (train: , test/valid: )444Please note that these results were also replicated with a resampled dataset with for training. and the original, confounded dataset. We train a VGG [67] and four Vision Transformer [ViT, 18] architectures, with number of parameters ranging from 17 to 690 millions.
We observe that, while training with balanced data leads to higher worst group accuracy and lower equalized odds scores than training with the historical data (Table 2), an important gap remains between the overall and worst group performances. These results show that data balancing leads to improvements in downstream fairness and robustness metrics, but does not provide a risk-invariant or fair model on its own. Therefore, it is likely that one of the conditions for data balancing to be sufficient is not fulfilled and understanding which condition is violated can guide our selection of another technique.
Distinguishing between failure modes. We first assume that the task is anti-causal. We then aim to understand whether there is another confounder, the data is entangled, or the representation is entangled (Proposition 4.4). As per Kirichenko et al. [40], we first attempt to improve our representation by pre-training the VGG with ImageNet [16]. While we observe an increase in performance with pre-training, there is no clear decrease in equalized odds. This result suggests that the failure may lie elsewhere. We then train models with MMD on , with the expectation that we would observe a plateau for entangled data when the model learns , or a stark decrease in worst group performance in the presence of another confounder. While there is no major pattern of correlation between and another attribute in the balanced data (see Appendix E.2.2), small effects might combine, or there might be other, unobserved attributes that influence . For a medium value of the regularization hyper-parameter, the model displays a plateau in performance and poor worst group performance. This result suggests an effect of another confounder and next steps can include methods such as Alabdulmohsin et al. [2], which controls for all (observed) auxiliary factors of variation.
Model | Acc. () | Worst Grp () | Encoding () | Equ. Odds () |
---|---|---|---|---|
Original | ||||
Balanced | ||||
Pre-trained | ||||
MMD on |
![Refer to caption](extracted/5687372/figs/Model_performance_CelebA_MMD_P.png)
7 Related works
Balanced data as mitigation for invariant models. Our results extend those of Makar et al. [47] which considered a single causal graph. Wang et al. [75] displayed that balancing data did not lead to a reduction in bias amplification. The authors posit that this failure of balanced data to correct for spurious signals is due to unobserved confounding factors which is confirmed in Alabdulmohsin et al. [2]. Rolf et al. [62] investigated upsampling by relying on a scaling law per group, focusing on the question of fairness vs performance trade-off [22]. Focusing on causal NLP settings, Joshi et al. [36] investigated causal and non-causal features, concluding that data balancing does not help in all cases. Closer to our work is that of Puli et al. [57], in which the authors showed that having does not imply that and the model can learn signals related to . Puli et al. [57] propose a method to learn a representation such that . Our work provides a framework to understand these different failure modes and proposes strategies to distinguish between them. While we focus on pre-processing mitigation with a fixed distribution , another line of work considers dynamic resampling in-processing [e.g. 35, 60, 12]. As the resampling converges towards a fixed distribution , we would expect failure modes in the presence of entangled data or of another confounder. Nevertheless, the variation in at the early stages of training might be beneficial, e.g. by disentangling the representation. We leave this investigation for future work.
Causal feature selection. Some works have used a causal framing to select features such that has robustness and/or fairness properties [e.g. 46, 70, 68, 25, 66]. Similarly, our work defines independence conditions on covariates to obtain an optimal, invariant model, and can be used to select features. Two major distinctions between feature selection works and ours reside in the fact that we consider the case in which we do not observe explicitly and that we investigate the impact of data balancing.
8 Discussion
In this work, we uncover important results to guide the use of data balancing for mitigating undesired dependencies between covariates, outcomes and auxiliary factors of variation. We first show (Section 3) that joint data balancing might not achieve the desired fairness or robustness criteria, and that the failures may seem difficult to predict. Motivated by these results, we introduce conditions under which data balancing leads to a robust or fair model (Sections 4, B.2). Importantly, we show that data balancing is not equivalent to ‘drop** an edge’ in the causal graph and can lead to distributions that do not factorize according to the desired graph (Section 5). This can have downstream consequences if further mitigation strategies are motivated by the causal graph and highlights why regularization and data balancing might not go ‘hand in hand’. This last result shows that data balancing should not be performed as a ‘default’, and mitigation strategies should be based on the causal graph of the application. Finally, even in the absence of a causal graph, our findings may help to pinpoint which condition(s) are not fulfilled, and guide further mitigation (Section 6).
Limitations. The conditions defined in Section 4 for risk-invariance depend on the expression of as a correlation shift [47, 61]. Other expressions are likely to lead to other conditions. In our experiments, we have mostly subsampled datasets to obtain balanced distributions. We would expect similar results for other joint balancing methods. Variations are, however, possible due to the finite-set nature of the computations [47], e.g. with reweighting displaying more variance [33], potentially under-performing in overparametrized settings [11, 64]. We also note that, while we aimed to provide upper bounds for the effectiveness of data balancing, we did not use additional training strategies for mitigation beyond regularization. We believe that our causal framework can be a useful tool to analyze other pre- or in-processing methods that enforce independence between variables in the data generating process [e.g. 1, 57]. On the other hand, our framework might not be suited to analyze the effects of other mitigation strategies, e.g. hyper-parameter optimization [56].
Future work. This work considered a variety of causal graphs in order to provide general insights rather than task-specific conditions. However, investigating specific graphs could enable to leverage further strategies including other balancing techniques [e.g. 71]. We believe that our causal framing could then be a useful resource to analyze the effect of these strategies on downstream fairness and robustness criteria. Finally, we illustrate our propositions with binary classification tasks and confounders. While our reasoning applies to more complex settings, there might be further considerations to account for when generalizing beyond binary variables, especially with respect to estimation.
Broader impact
Our work investigates a common mitigation strategy for failures of fairness or robustness in machine learning predictive settings. We aim to clearly highlight when data balancing is promising, and when it fails, hence advancing the field of trustworthy machine learning. As with most papers addressing fairness questions, we acknowledge that our mathematical formulations of fairness criteria might not correspond to the desired societal impact, e.g. in terms of equity. Specific considerations for our work include the use of the CelebA [45] dataset, and in particular the ‘is-male’ binary label provided. We acknowledge that a binary characterization of gender is not representative and can be harmful. In addition, it would be desirable to have self-reported instead of perceived gender. Our work considers cases for which auxiliary factors of variation are observed at train, test or fine-tuning time. This is a limitation of our investigation, as our insights might not be available when is unobserved. This is exemplified by the more difficult case of distinguishing between failure modes without a in the classification of CelebA images.
Acknowledgments and Disclosure of Funding
We thank Virginia Aglietti for feedback on this work and Victor Veitch for sharing experimental code for the Amazon reviews experiments. This work was funded by Google DeepMind.
References
- Alabdulmohsin & Lučić [2021] Alabdulmohsin, I. and Lučić, M. A near-optimal algorithm for debiasing trained machine learning models. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=H5TBqNFPKSJ.
- Alabdulmohsin et al. [2024] Alabdulmohsin, I., Wang, X., Steiner, A., Goyal, P., D’Amour, A., and Zhai, X. CLIP the bias: How useful is balancing data in multimodal learning? In International Conference on Learning Representations, 2024.
- Anthis & Veitch [2023] Anthis, J. R. and Veitch, V. Causal context connects counterfactual fairness to robust prediction and group fairness. In Advances in Neural Information Processing Systems, volume 37, 2023. URL https://openreview.net/forum?id=AmwgBjXqc3.
- Arjovsky et al. [2019] Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization, 2019. Preprint 1907.02893. URL http://arxiv.longhoe.net/abs/1907.02893.
- Barocas et al. [2023] Barocas, S., Hardt, M., and Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities. MIT Press, 2023.
- Ben-Tal et al. [2013] Ben-Tal, A., den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Manage. Sci., 59(2):341–357, 2013.
- Bradbury et al. [2018] Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., and Zhang, Q. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Brown et al. [2023] Brown, A., Tomasev, N., Freyberg, J., Liu, Y., Karthikesalingam, A., and Schrouff, J. Detecting shortcut learning for fair medical AI using shortcut testing. Nat. Commun., 14(1):4314, 2023.
- Byrd & Lipton [2019] Byrd, J. and Lipton, Z. What is the effect of importance weighting in deep learning? In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 872–881. PMLR, 2019.
- Carlini & Wagner [2017] Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. IEEE, 2017.
- Celis et al. [2018] Celis, E., Keswani, V., Straszak, D., Deshpande, A., Kathuria, T., and Vishnoi, N. Fair and diverse DPP-based data summarization. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 716–725. PMLR, 2018. URL https://proceedings.mlr.press/v80/celis18a.html.
- Chen et al. [2023] Chen, X., Fan, W., Chen, J., Liu, H., Liu, Z., Zhang, Z., and Li, Q. Fairly adaptive negative sampling for recommendations. In Proceedings of the ACM Web Conference 2023, WWW ’23, pp. 3723–3733, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394161. doi: 10.1145/3543507.3583355. URL https://doi.org/10.1145/3543507.3583355.
- Chiappa [2019] Chiappa, S. Path-Specific counterfactual fairness. AAAI, 33(01):7801–7808, 2019.
- Compton et al. [2023] Compton, R., Zhang, L., Puli, A., and Ranganath, R. When more is less: Incorporating additional datasets can hurt performance by introducing spurious correlations, 2023. Preprint 2308.04431. URL http://arxiv.longhoe.net/abs/2308.04431.
- Cowell et al. [2007] Cowell, R. G., Dawid, A. P., Lauritzen, S., and Spiegelhalter, D. J. Probabilistic Networks and Expert Systems, Exact Computational Methods for Bayesian Networks. Springer-Verlag, 2007.
- Deng et al. [2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE, 2009.
- Deng [2012] Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Dosovitskiy et al. [2021] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Drenkow et al. [2021] Drenkow, N., Sani, N., Shpitser, I., and Unberath, M. A systematic review of robustness in deep learning for computer vision: Mind the gap?, 2021. Preprint 2112.00639. URL http://arxiv.longhoe.net/abs/2112.00639.
- Duchi et al. [2016] Duchi, J., Glynn, P., and Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach, 2016. Preprint 1610.03425. URL http://arxiv.longhoe.net/abs/1610.03425.
- Duffy et al. [2022] Duffy, G., Clarke, S. L., Christensen, M., He, B., Yuan, N., Cheng, S., and Ouyang, D. Confounders mediate AI prediction of demographics in medical imaging. NPJ Digit Med, 5(1):188, 2022.
- Dutta et al. [2020] Dutta, S., Wei, D., Yueksel, H., Chen, P.-Y., Liu, S., and Varshney, K. Is there a trade-off between fairness and accuracy? A perspective using mismatched hypothesis testing. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2803–2813. PMLR, 2020. URL https://proceedings.mlr.press/v119/dutta20a.html.
- Dwork et al. [2012] Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12, pp. 214–226, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450311151. doi: 10.1145/2090236.2090255. URL https://doi.org/10.1145/2090236.2090255.
- Flores et al. [2016] Flores, A. W., Bechtel, K., and Lowenkamp, C. T. False positives, false negatives, and false analyses: A rejoinder to “machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.”. Fed. Probat., 80(2), 2016.
- Galhotra et al. [2022] Galhotra, S., Shanmugam, K., Sattigeri, P., and Varshney, K. R. Causal feature selection for algorithmic fairness. In Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, pp. 276–285, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392495. doi: 10.1145/3514221.3517909. URL https://doi.org/10.1145/3514221.3517909.
- Geirhos et al. [2019] Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bygh9j09KX.
- Gichoya et al. [2022] Gichoya, J. W., Banerjee, I., Bhimireddy, A. R., Burns, J. L., Celi, L. A., Chen, L.-C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.-C., Kuo, P.-C., Lungren, M. P., Palmer, L. J., Price, B. J., Purkayastha, S., Pyrros, A. T., Oakden-Rayner, L., Okechukwu, C., Seyyed-Kalantari, L., Trivedi, H., Wang, R., Zaiman, Z., and Zhang, H. AI recognition of patient race in medical imaging: a modelling study. Lancet Digit Health, 4(6):e406–e414, 2022.
- Gretton et al. [2012] Gretton, A., Borgwardt, K. M., Rasch, M. J., and Scholkopf, B. A kernel Two-Sample test. J. Mach. Learn. Res., 13(25):723–773, 2012.
- Hardt et al. [2016] Hardt, M., Price, E., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf.
- Harris et al. [2020] Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., Del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature, 585(7825):357–362, 2020.
- Hooker et al. [2020] Hooker, S., Moorosi, N., Clark, G., Bengio, S., and Denton, E. Characterising bias in compressed models, 2020. Preprint 2010.03058. URL http://arxiv.longhoe.net/abs/2010.03058.
- Hunter [2007] Hunter, J. D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng., 9(3):90–95, 2007.
- Idrissi et al. [2022] Idrissi, B. Y., Arjovsky, M., Pezeshki, M., and Lopez-Paz, D. Simple data balancing achieves competitive worst-group-accuracy. In Schölkopf, B., Uhler, C., and Zhang, K. (eds.), Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp. 336–351. PMLR, 2022. URL https://proceedings.mlr.press/v177/idrissi22a.html.
- J. Devlin & Toutanova [2019] J. Devlin, M.-W. Chang, K. L. and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), volume 1, pp. 2, 2019.
- Jiang & Nachum [2020] Jiang, H. and Nachum, O. Identifying and correcting label bias in machine learning. In Chiappa, S. and Calandra, R. (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 702–712. PMLR, 2020. URL https://proceedings.mlr.press/v108/jiang20a.html.
- Joshi et al. [2022] Joshi, N., Pan, X., and He, H. Are all spurious features in natural language alike? an analysis through a causal lens. In Empirical Methods in Natural Language Processing (EMNLP), 2022.
- Kamiran & Calders [2012] Kamiran, F. and Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst., 33(1):1–33, 2012.
- Kehrenberg et al. [2020] Kehrenberg, T., Chen, Z., and Quadrianto, N. Tuning fairness by balancing target labels. Front Artif Intell, 3:33, 2020.
- Kim et al. [2023] Kim, D., Park, S., Hwang, S., and Byun, H. Fair classification by loss balancing via fairness-aware batch sampling. Neurocomputing, 518:231–241, 2023.
- Kirichenko et al. [2022] Kirichenko, P., Izmailov, P., and Wilson, A. G. Last layer re-training is sufficient for robustness to spurious correlations, 2022. Preprint 2204.02937. URL http://arxiv.longhoe.net/abs/2204.02937.
- Koller & Friedman [2009] Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
- Krizhevsky et al. [2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
- LaBonte et al. [2023] LaBonte, T., Muthukumar, V., and Kumar, A. Towards last-layer retraining for group robustness with fewer annotations, 2023. Preprint 2309.08534. URL http://arxiv.longhoe.net/abs/2309.08534.
- Lecun et al. [1998] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
- Liu et al. [2015] Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015.
- Magliacane et al. [2018] Magliacane, S., van Ommen, T., Claassen, T., Bongers, S., Versteeg, P., and Mooij, J. M. Domain adaptation by using causal inference to predict invariant conditional distributions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Makar et al. [2022] Makar, M., Packer, B., Moldovan, D., Blalock, D., Halpern, Y., and D’Amour, A. Causally motivated shortcut removal using auxiliary labels. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 739–766. PMLR, 2022. URL https://proceedings.mlr.press/v151/makar22a.html.
- Mao et al. [2023] Mao, Y., Deng, Z., Yao, H., Ye, T., Kawaguchi, K., and Zou, J. Last-layer fairness fine-tuning is simple and effective for neural networks, 2023. Preprint 2304.03935. URL http://arxiv.longhoe.net/abs/2304.03935.
- McKinney [2010] McKinney, W. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. SciPy, 2010.
- Mehrabi et al. [2021] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv., 54(6):1–35, 2021.
- Mooij et al. [2020] Mooij, J. M., Magliacane, S., and Claassen, T. Joint causal inference from multiple contexts. J. Mach. Learn. Res., 21(99):1–108, 2020.
- Ni et al. [2019] Ni, J., Li, J., and McAuley, J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 188–197, 2019.
- Obermeyer et al. [2019] Obermeyer, Z., Powers, B., Vogeli, C., and Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
- Pearl [1988] Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., 1988.
- Pearl [2000] Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
- Perrone et al. [2021] Perrone, V., Donini, M., Zafar, M. B., Schmucker, R., Kenthapadi, K., and Archambeau, C. Fair bayesian optimization. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 854–863, 2021.
- Puli et al. [2022] Puli, A. M., Zhang, L. H., Oermann, E. K., and Ranganath, R. Out-of-distribution generalization in the presence of nuisance-induced spurious correlations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=12RoR2o32T.
- Quinonero-Candela et al. [2022] Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (eds.). Dataset shift in machine learning. Neural Information Processing series. MIT Press, London, England, 2022.
- Rančić et al. [2021] Rančić, S., Radovanović, S., and Delibašić, B. Investigating oversampling techniques for fair machine learning models. In Decision Support Systems XI: Decision Support Systems, Analytics and Technologies in Response to Global Crisis Management, pp. 110–123. Springer International Publishing, 2021.
- Roh et al. [2021] Roh, Y., Lee, K., Whang, S. E., and Suh, C. Fairbatch: Batch selection for model fairness. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YNnpaAKeCfx.
- Roh et al. [2023] Roh, Y., Lee, K., Whang, S. E., and Suh, C. Improving fair training under correlation shifts. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 29179–29209. PMLR, 2023. URL https://proceedings.mlr.press/v202/roh23a.html.
- Rolf et al. [2021] Rolf, E., Worledge, T. T., Recht, B., and Jordan, M. Representation matters: Assessing the importance of subgroup allocations in training data. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 9040–9051. PMLR, 2021. URL https://proceedings.mlr.press/v139/rolf21a.html.
- Sagawa* et al. [2020] Sagawa*, S., Koh*, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryxGuJrFvS.
- Sagawa et al. [2020] Sagawa, S., Raghunathan, A., Koh, P. W., and Liang, P. An investigation of why overparameterization exacerbates spurious correlations. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8346–8356. PMLR, 2020. URL https://proceedings.mlr.press/v119/sagawa20a.html.
- Schölkopf et al. [2012] Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. In International Conference on Machine Learning, pp. 459–466, 2012.
- Schrouff et al. [2022] Schrouff, J., Harris, N., Koyejo, S., Alabdulmohsin, I. M., Schnider, E., Opsahl-Ong, K., Brown, A., Roy, S., Mincu, D., Chen, C., Dieng, A., Liu, Y., Natarajan, V., Karthikesalingam, A., Heller, K. A., Chiappa, S., and D’Amour, A. Diagnosing failures of fairness transfer across distribution shift in real-world medical settings. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 19304–19318. Curran Associates, Inc., 2022.
- Simonyan & Zisserman [2015] Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Singh et al. [2021] Singh, H., Singh, R., Mhasawade, V., and Chunara, R. Fairness violations and mitigation under covariate shift. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 3–13. Association for Computing Machinery, New York, NY, USA, 2021.
- Sreekumar & Boddeti [2023] Sreekumar, G. and Boddeti, V. N. Spurious correlations and where to find them, 2023. Preprint 2308.11043. URL http://arxiv.longhoe.net/abs/2308.11043.
- Subbaswamy & Saria [2018] Subbaswamy, A. and Saria, S. Counterfactual normalization: Proactively addressing dataset shift using causal mechanisms. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp. 947–957. Association For Uncertainty in Artificial Intelligence (AUAI), 2018.
- Sun et al. [2023] Sun, Q., Murphy, K., Ebrahimi, S., and D’Amour, A. Beyond invariance: Test-time label-shift adaptation for distributions with "spurious" correlations, 2023. Preprint 2211.15646. URL http://arxiv.longhoe.net/abs/2211.15646.
- Touvron et al. [2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. Training data-efficient image transformers & distillation through attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 10347–10357. PMLR, 2021.
- Veitch et al. [2021] Veitch, V., D’Amour, A., Yadlowsky, S., and Eisenstein, J. Counterfactual invariance to spurious correlations in text classification. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=BdKxQp0iBi8.
- Wang & Russakovsky [2023] Wang, A. and Russakovsky, O. Overcoming bias in pretrained models by manipulating the finetuning dataset, 2023. Preprint 2303.06167. URL http://arxiv.longhoe.net/abs/2303.06167.
- Wang et al. [2019] Wang, T., Zhao, J., Yatskar, M., Chang, K., and Ordonez, V. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5309–5318, Los Alamitos, CA, USA, 2019. IEEE Computer Society. doi: 10.1109/ICCV.2019.00541. URL https://doi.ieeecomputersociety.org/10.1109/ICCV.2019.00541.
- Wu et al. [2023] Wu, S., Yuksekgonul, M., Zhang, L., and Zou, J. Discover and cure: concept-aware mitigation of spurious correlation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Yan et al. [2020] Yan, S., Kao, H.-T., and Ferrara, E. Fair class balancing: Enhancing model fairness without observing sensitive attributes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pp. 1715–1724, New York, NY, USA, 2020. Association for Computing Machinery.
- Yang et al. [2023a] Yang, Y., Nushi, B., Palangi, H., and Mirzasoleiman, B. Mitigating spurious correlations in multi-modal models during fine-tuning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 39365–39379. PMLR, 2023a. URL https://proceedings.mlr.press/v202/yang23j.html.
- Yang et al. [2023b] Yang, Y., Zhang, H., Gichoya, J. W., Katabi, D., and Ghassemi, M. The limits of fair medical imaging ai in the wild, 2023b. Preprint 2312.10083. URL http://arxiv.longhoe.net/abs/2312.10083.
- Zemel et al. [2013] Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 325–333, Atlanta, Georgia, USA, 2013. PMLR. URL https://proceedings.mlr.press/v28/zemel13.html.
Appendix A Failure modes of data balancing
A.1 Failure mode: Balancing on one variable can increase bias
It is common to consider balancing on classes or groups as it requires fewer labels than joint balancing. However, without further intervention, class or group balancing on its own does not provide an invariant model when and are marginally dependent [e.g. 43]. In Figure 1(a), this means that , invalidating Prop.4.2. Below, we formalize the observation in Yan et al. [77] that balancing on one variable might affect the representation of the other, and provide bounds on the impact of this strategy.
Formalization and proof.
We formalize this issue in Proposition A.1 for the binary case with a binary attribute.
Proposition A.1.
Consider data balancing of ; the marginal of will be farther from uniform than the marginal of before balancing if
Intuitively, if the biases of and are in the same (resp. opposite) direction, then this condition is satisfied if has a negative (resp. positive) correlation with . For example, if we have , and , then before balancing but after balancing.
Proof of Proposition A.1..
We assume that and , representing the label and confounder, are both binary. We will data-balance on . Let denote the distribution of after data balancing. To characterize when the distribution of is farther from uniform than the distribution of , we will first derive
and
Now, taking the difference, we have
We can derive some sufficient conditions for bias increase, which occurs when . We proceed by cases. If , then
so . Thus, the bias gets worse if .
Similar reasoning shows that if , then
and we can conclude that the bias is worsened if . Taking both statements together, we obtain the statement of the proposition. ∎
For example, if we have , and , then but ; despite starting as unbiased, the data balancing induces a bias of .
There are a few implications of this derivation. First, we obtain an easy upper bound for the worsening of the bias of caused by data balancing: taking absolute values of both sizes and using the triangle inequality on the right yields
Bringing the second term over to the left hand side and applying the same logic produces
and combining both terms shows that the difference in bias of and is bounded by
Simulation.
We present a simple simulation to illustrate our reasoning: is a common cause to and . More specifically, the continuous distributions of and both have the form , with . We then binarize by thresholding at 0. This creates an imbalance in the marginal of , such that a random sample of 5,000 examples has of positive labels. We then want to vary the marginal of , which also requires affecting their correlation. To this end, we vary the threshold for binarizing . This leads us to 2 main cases: for thresholds above 0 (i.e. ’s threshold), the marginal of is imbalanced in the same direction as that of . For thresholds smaller than 0., we obtain the opposite, i.e. if is over-represented, is under-represented.
We illustrate these 2 cases in Figure 6. We observe that when the marginals are similar, balancing brings closer to a uniform distribution (top row). However, the marginal distribution of becomes more imbalanced after balancing on if the two distributions are reversed (bottom row). When the correlation is small, there is little change in the marginal of when balancing on , which is expected.
Same direction
![Refer to caption](extracted/5687372/figs/balancing_y_pos.jpg)
Reverse direction
![Refer to caption](extracted/5687372/figs/balancing_y_neg.jpg)
For completeness, we perform 200 simulations with different thresholdings for and present the results in Figure 7.
![Refer to caption](extracted/5687372/figs/balancing_y_corrs.jpg)
A.2 Failure mode: entangled signals
In the case where includes non-trivial intersection information , data balancing will in general be insufficient to ensure that there is no association bias. This is because a risk-minimizing predictor will condition on , and the distribution of these intersection features is influenced by .
Specifically, we will give a case where is marginally independent of and there is no uncontrolled confounding, but .
Suppose we have the following data generating process (DGP):
Note that in this case the entirety of would be classified as intersection information .
In this setup, the Bayes-optimal probabilities for classification, , are given by:
and
Note that when we condition on , the expectation of is different whenever (1) , i.e., whenever the distribution of actually depends on the function of and , and (2) , i.e., there is some information in to predict :
In the simple case where and (i.e., deterministically), we get
Appendix B Conditions for data balancing to lead to an invariant and optimal model
We first investigate the case of a risk-invariant model w.r.t , and then discuss fairness criteria.
B.1 Risk-invariant, optimal model
In this section we provide proofs for Section 4.
Recall that and that we assume a data balancing distribution of the form . Also recall that we define to be a sufficient statistic for in if .
Proposition 4.2.
If and is a sufficient statistic for in , then the risk-minimizer is risk-invariant and optimal w.r.t. .
Proof.
Let be an arbitrary distribution in . We have
where (1) holds as and by the independence assumption. As we obtain . As is a sufficient statistic for in , , that is (and therefore the loss ) remains constant for different values of , giving
The same reasoning can be repeated for , obtaining
,
which proves that is risk-invariant w.r.t. .
As and , we obtain , , which implies that is optimal w.r.t. .
∎
Corollary 4.3.
Let . If and , then is risk-invariant and optimal w.r.t. .
Proof.
We have
where (1) holds by the definition of the balanced distribution and (2) holds by the independence assumptions. This derivation shows that and therefore that is a sufficient statistic for in . We are in the same conditions as in Proposition 4.2, which implies that is risk-invariant and optimal w.r.t. . ∎
Proposition 4.4.
Let be disentangled with and be a linear function. The risk-minimizer is optimal and risk-invariant across if , is a sufficient statistic for in and .
Proof.
The proof is straightforward as it directly depends on the definition of a disentangled representation and the previous statements:
Where (1) reflects the assumption of a disentangled representation, and (2) uses the proof of Proposition 4.2. ∎
B.2 Conditions for data balancing to lead to a fair model
This section gives several results to illustrate the fact that data balancing implemented to generate independence between outcomes and sensitive attributes does not necessarily imply that a function of some covariates to predict will be independent of (or not encode information on) . The results we describe do not address the case where is not accessible directly.
Proposition B.1 (Demographic parity).
if ; that is balancing successfully induces independence between and if and are independent given in the original data distribution.
Proof.
Let . The following derivation demonstrates the claim,
where holds by the definition of data balancing on the joint, holds by the assumption of conditional independence. Therefore, the l.h.s is not a function of which establishes marginal independence. ∎
Proposition B.2.
In general, and are not independent in if and are not independent given in ; that is data balancing does not induce independence between and if and are not independent given in the original data distribution.
Proof.
Note first that the reduction in does not hold in general without conditional independence. Further, note that,
If and are dependent given in then and are dependent given in so that varies with , making the l.h.s a function of in general. Therefore, in general, data balancing will not be successful without conditional independence. ∎
Proposition B.3 (Predictive parity).
if ; that is data balancing successfully induces independence between and given if and are independent given in the original data distribution.
Proof.
Let . The following derivation demonstrates the claim,
where holds by the definition of data balancing on the joint, holds by the assumption of conditional independence. Therefore, the l.h.s is not a function of which establishes conditional independence. ∎
Proposition B.4.
In general, and are not independent given in if and are not independent given in ; that is data balancing does not induce independence between and given if and are not independent given in the original data distribution.
Proof.
Similarly to the arguments above, the reduction in does not hold in general without conditional independence. Therefore, in general, data balancing will not be successful without conditional independence. ∎
Proposition B.5 (Equalized odds).
if ; that is data balancing does not disturb independence between and given if and are independent given in the original data distribution.
Proof.
Let . Note that in this case we just need to show that data balancing does not disturb the conditional independence present in the original data (we already had equalized odds in original data). The following derivation demonstrates the claim,
where holds by the definition of data balancing, holds by the assumption of conditional independence. Therefore, the l.h.s is not a function of which establishes conditional independence. ∎
Proposition B.6.
In general, and are not independent given in if and are not independent given in ; that is data balancing does not induce independence between and if and are not independent given in the original data distribution.
Proof.
Similarly to the arguments above, the reduction in does not hold in general without conditional independence. Therefore, in general, data balancing will not be successful without conditional independence. ∎
Appendix C Impact of data balancing on the CBN
In the following we assume that is discrete, but all the results remain valid for continuous .
Proposition 5.1.
Let be the CBN underlying the data, where contains an undesired path between and , and let be a modification of in which the undesired path has been removed. The distribution obtained by joint balancing the data to make and statistically independent, i.e. , might not factorize according to .
Proof.
Example 1: Causal task with causal and non-causal paths. Consider , for unobserved . We have
where the r.h.s is a function of in general as is not independent of given in . If were , then in . To show the claim it suffices therefore to construct a distribution such that is not independent of given .
Example 2: Causal task with non-causal path. Consider . We have that,
The r.h.s is a function of in general as is not independent of given in a distribution consistent with . Therefore, one may not interpret the mutilated graph as a correct representation of the conditional independencies implied by the balanced distribution .
Example 3: Causal task with causal path. Consider . We have that,
The r.h.s is a function of in general as is not independent of given in . Therefore, one may not interpret the mutilated graph as a correct representation of the conditional independencies implied by the balanced distribution .
Example 4: Anti-causal task. Consider . We have that,
The r.h.s is a function of in general as is not independent of given in a distribution consistent with . Therefore, one may not interpret the mutilated graph as a correct representation of the conditional independencies implied by the balanced distribution .
∎
C.1 Regularization and data balancing don’t always go hand in hand
C.1.1 Risk-invariance
We first consider the graph in Figure 1(d) and show that in both , which justifies its use in addition to data balancing, although there might not be a benefit of using both techniques simultaneously (in theory).
Proposition C.1.
Consider the graph in Figure 1(d). Then in both the training data distribution (consistent with ) and the distribution after balancing, namely .
Proof.
holds in the training data distribution by -separation. For the conditional independence in , consider the following derivation,
The r.h.s is not a function of and therefore holds in . ∎
However, when considering the graph in Figure 1(b), we introduce a dependence between and , which can be easily checked by the simulation Figure 8 in which we consider the simplified graph . While we are able to obtain the marginal dependence between and (), we introduce a dependence between and ().
C.1.2 When does data-balancing together with regularization lead to fair models?
This section gives several results to analyze the combination of data balancing implemented to generate independence between outcomes and sensitive attributes and regularization in two variants. First, regularizing to learn representations such that ; and second regularizing to learn representations such that . We write to state that and are independent in distribution .
Regularization such that .
Proposition C.2 (Demographic parity).
Balancing and regularization such that and is sufficient for demographic parity, i.e. .
Proof.
where (1) holds by the assumption of balancing in which and regularization . ∎
Proposition C.3 (Predictive parity).
Balancing and regularization such that and is sufficient for predictive parity, i.e. .
Proof.
where both equalities hold by the assumption of balancing in which and regularization . ∎
Proposition C.4 (Equalized odds).
Balancing and regularization such that and is sufficient for equalized odds, i.e. .
Proof.
Regularization induces and so equalized odds is satisfied by design. ∎
Remark: Note that balancing and regularization together are not always necessary, for example the section above shows that balancing on its own can be successful in some cases.
Regularization such that .
Proposition C.5 (Demographic parity).
Balancing and regularization such that and is sufficient for demographic parity, i.e. .
Proof.
Regularization induces and so demographic parity is satisfied by design. ∎
Proposition C.6 (Predictive parity).
Balancing and regularization such that and is not sufficient for predictive parity, i.e. does not hold.
Proof.
We give a counter-example. Let be three independent variables with values in . Let . Let be a probability distribution over . In particular, we could imagine to be generated after balancing and regularization since and . However, conditioned on , and determine each other and so predictive parity does not hold in . ∎
Proposition C.7 (Equalized odds).
Balancing and regularization such that and is not sufficient for equalized odds, i.e. does not hold.
Proof.
The counter-example above applies. ∎
Appendix D Experiments
D.1 Datasets
This work uses the MNIST [44, 17, http://yann.lecun.com/exdb/mnist/], Amazon reviews [52], ImageNet [16, https://image-net.org/] and CelebA [45, http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html] datasets, which are all openly accessible and can be used for research purposes.
MNIST semi-synthetic data: For simplicity, we binarize the digit recognition task to a label according to whether the number in the image is or such that matches the ground truth with probability . The top of the image is replaced by noise coloured in red for and blue for (see Figure 2). We can relate the confounder and the label such that (resp. ) of images with have a red (resp. blue) noise pattern, while (resp. ) of the images with have a red (resp. blue) pattern, corresponding to our original distribution . In this distribution, the marginal distributions of and are (close to) uniform. We sample samples from , as well as a dataset jointly balanced on and (, ). We also sample test data based on a ground truth generated with (). Finally, we generate an dataset that contains white instead of colored noise.
MNIST semi-synthetic data with added confounder: We add and to our data generating process where is a green cross either on the left or right of the image, with a fixed vertical position. The horizontal position of the cross is given by and is correlated with (, ). We generate a confounded dataset (95/10) as previously, which we balance jointly on and . We then train 5 replicates of the same architecture, and test our model on , as well as on the ground truth where .
MNIST semi-synthetic data, entangled: We define the color of the noise based on an . We define by generating samples with , while is represented by the disentangled test dataset described above.
Amazon reviews with confounder: We refer to Veitch et al. [73] and define a causal task based on Amazon reviews for the clothing category which predicts whether the review was found to be helpful (i.e. obtained ‘thumbs up’ votes) or not based on the review’s text. We generate a random variable as the unobserved confounder, and define as the binary helpfulness label, randomly flip** the label based on (association: p=0.4). This leads to reviews with being more associated with . We define as , where is another random variable distributed uniformly and is a parameter that controls the relationship between and , and by transitivity, between and . In , is selected to be 0.8, leading to a correlation of 0.35 between and . To create , we add perturbations to the text based on the value of that wouldn’t (in theory) affect . We select the words {and, the, you, my, they} and add a suffix ‘xxxx’ (resp. ‘yyyy’) when (resp. ). Finally, is imbalanced, with only of the dataset with . We hence re-balance the classes before the modelling. This operation is also performed by the joint balancing.
D.2 Metric definitions and operationalization
Our work focuses on statistical group fairness criteria [5]. These can be translated as independence criteria on the model’s predictions.
Definition D.1 (Demographic parity).
A predictor is said to satisfy demographic parity w.r.t. sensitive attribute and distribution if .
Definition D.2 (Predictive parity).
A predictor trained to predict an outcome is said to satisfy predictive parity w.r.t. sensitive attribute and distribution if .
Definition D.3 (Equalized odds).
A predictor trained to predict an outcome is said to satisfy equalized odds w.r.t. a sensitive attribute and distribution if .
In our experiments, we estimate equalized odds as in Alabdulmohsin & Lučić [1]. For this metric, the lower, the better.
In terms of robustness metrics, we evaluate a simplified version of risk-invariance by computing model performance on a test set sampled from , and contrasting this result with the model’s performance on a test set sampled from (when known), or from . We also estimate worst-group performance [63] as:
An invariant model that is optimal would hence display high performance on both and /, as well as high worst-group accuracy.
Metrics like risk-invariance or equalized odds provide insights on the model’s outputs, but do not probe the model’s representation. As we are interested in large-scale models that might be further fine-tuned, it is important to understand whether the model’s representation is invariant on . Defining a representation as , we can write in which we assume the representation to be fixed (i.e. frozen model weights) and is a learnable function. In Zemel et al. [80], the authors define a fair representation w.r.t. a binary as demographic parity on the representation:
where corresponds to the samples with . This is equivalent to assessing the ‘encoding’ of in , by training a linear layer [27, 8]. Chance level performance of would then suggest that the representation is independent of . In the present work, we estimate the encoding of using or such that assessing the encoding of is equivalent to assessing the encoding of . Models that encode less of the auxiliary factor have been shown to reach a more ‘global’ optimum compared to models that encode the signal more strongly [independently of whether invariant predictions are obtained 79].
D.3 Model architectures
We consider multiple architectures in this work, with an attempt to cover different model sizes and characteristics.
-
•
Small convolutional network, similar in spirit to AlexNet [42]. It includes 5 convolution blocks with kernel sizes (4, 3, 2, 2, 2, 2) and output channels (3, 6, 9, 12, 12, 9), with max pooling after each convolution, as well as two dense layers with Relu non-linearity before the output head.
-
•
VGG network [67] with square kernels of size 3, output channels of dimensions (64, 64, 128, 128, 128, 256, 256, 256, 512, 512, 512) and strides (1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1).
- •
-
•
For text data, we use the BERT architecture, as defined in TensorFlow Hub.
We use a stochastic gradient descent optimizer with Nesterov momentum of for all models.
D.3.1 Hyper-parameter searches
We include a hyper-parameter search over the learning rate (5 values in log-scale between and ) coupled with a batch size search between sizes of 128, 256 and 512 examples. In terms of regularization, the small convolutional network include dropout in the dense layers (search on 0.1, 0.2, 0.3), while VGG includes batch normalization in the dense layers (as per their original implementations). We impose an L2-regularization of during training for all architectures.
We note that hyper-parameters did not seem to make a difference on the MNIST results. For VGG, there was a larger variation, as well as a larger variance across multiple seeds.
When performing MMD conditional regularization, we vary the strength of the regularizer in , with 5 replicates for each value. To minimize computational expenses, we fix the learning rate to , dropout rate to and batch size to (for downsampled datasets) or .
D.4 Assets, code and resources
We use the BERT model bert_en_uncased_L-12_H768_A-12 from TensorFlow Hub. All other models are trained from scratch in our code infrastructure written in Python and JAX [7]. The results are then analyzed with Python and the numpy [30], matplotlib [32, https://matplotlib.org/] and pandas [49, https://pandas.pydata.org/] packages. For the small convolutional networks, training was performed with 4 GPUs (V100) and evaluation used 1 GPU per model instance. BERT used 2 Tensor Processing Units (TPUs) for training and 1 TPU for evaluation. For all other models, we used 4 Tensor Processing Units for training and 1 TPU or GPU (P100) for evaluation. We note that, apart from ViT-B and BERT, all experiments could be run on CPU.
Appendix E Results
E.1 Failure modes of data balancing with MNIST
Other confounder
We notice that correlation between and in is decreased () compared to () but is not null. In addition, we observe that the model relies on (accuracy on : , on : ). As a consequence, models trained on display a bias w.r.t. (see equalized odds and worst group performance).
Entangled signals
During training, the model reaches accuracy on , but only accuracy on . Worst-group accuracy is low and equalized odds high, displaying a failure mode of data balancing.
E.2 Celeb-A
E.2.1 Model performance
Model encoding and performance across different model sizes is displayed in Figure 9. We show that all models trained on the subsampled data display an encoding of the auxiliary factor .
![Refer to caption](extracted/5687372/figs/Model_encoding_size_celebA.png)
![Refer to caption](extracted/5687372/figs/Model_performance_size_celebA.png)
E.2.2 Distinguishing between failure modes
Correlation patterns in balanced data We plot the Pearson correlation between and all other available attributes (39 in CelebA) in Figure 10 (left), and similarly for (right). We note that the correlation that increases most when balancing the data is between and the ‘black hair’ label. As this label has a low correlation with , this does not seem problematic. We also observe smaller changes in attributes related to hair (‘bushy-eyebrows’, ‘bald’) and accessories (‘wearing-hat’).
![Refer to caption](extracted/5687372/figs/confounded_CelebA_balanced_Y.jpg)
![Refer to caption](extracted/5687372/figs/confounded_CelebA_balanced_Z.jpg)