Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement

Pushkar Shukla *
TTI-Chicago
[email protected]
&Dhruv Srikanth*
Carnegie Melon University
[email protected],
Lee Cohen
Stanford University
[email protected]
&Matthew Turk
TTI-Chicago
[email protected]
Abstract

We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.

1 Introduction

Computer vision systems trained on large amounts of data have been at the heart of recent technological innovation and growth. Such systems continue to find increasing use-cases in different areas such as healthcare, security, autonomous driving, remote sensing and education. However, in spite of the tremendous promise of large vision models, several studies have demonstrated the presence of unwanted societal biases in these models [6, 7, 18]. Therefore, understanding and mitigating these biases is crucial for deploying such applications in real-world scenarios.

Several approaches have been proposed to mitigate [7, 47, 18, 38, 55, 32, 23, 60, 56], measure [12, 3], and explain [2, 13, 61] biases in computer vision models. Among these approaches, the use of counterfactuals has emerged as a promising and prominent line of research [12, 3, 2, 13, 61]. A counterfactual for an image, within the context of a model, is a modified version of the original image where certain regions are systematically replaced. This alteration is used to assess the output of the system when presented with the modified image, while ensuring that the modified image remains similar to the original in terms of its key features and characteristics [54].

Refer to caption
Figure 1: An example of gender counterfactuals produced by StyleGAN2 versus our method, for a smile classification task. StyleGAN2’s images (right) correlate femininity with darker lipstick and exaggerated smiling, introducing biases. Our approach (left) generates ASACs that retain the same visual appearance as the original image.

However, current counterfactual generation algorithms used for bias mitigation have several limitations. Firstly, most of these algorithms rely on generative models to produce image counterfactuals with respect to a sensitive attribute (e.g., race, gender). Counterfactuals generated by these image generation mechanisms may inherently contain spurious correlations due to the presence of bias in the image generation model.

One such example of biased counterfactuals produced by an image generation model, StyleGAN2[25], is illustrated for smile classification in Figure 1, where female counterfactuals generated by the model exhibit exaggerated smiles and dark makeup. Using such counterfactuals to address gender biases in smile prediction may exacerbate the issue they are meant to solve. Additionally, generating counterfactuals based on sensitive attributes like race or gender raises ethical concerns regarding the stereotypes associated with images generated along dimensions of a sensitive attribute. For instance, attributing specific appearances (heavy makeup and increased smile) to female faces may oversimplify their diversity. Therefore, to ensure fairness in computer vision models, we must carefully generate and integrate counterfactuals without introducing new or exacerbating already present biases.

We propose using adversarial images as an alternative to generative models for creating image counterfactuals to assess and mitigate biases in computer vision models. Traditionally designed to deceive models, adversarial images can improve fairness metrics and accuracy. We hypothesize that these specifically designed adversarial images will reduce spurious biases and enhance model robustness and performance. Our method uses existing adversarial techniques to create counterfactuals that challenge vision models based on sensitive attributes. We refer to these examples as “Attribute-Specific Adversarial Counterfactuals” or ASACs. Our method ensures that ASACs retain the visual appearance of the original image, effectively addressing ethical concerns regarding the quality of image counterfactuals. By kee** parts of the image unaltered, the likelihood of introducing spurious correlations (propagated by the generative model during the image generation process) into the fairness mitigation pipeline is markedly reduced.

In addition, our work introduces a novel curriculum learning based fine-tuning approach that utilizes these counterfactuals to mitigate biases in vision models. Our approach utilizes ASACs generated at different noise magnitudes and assigns a learning curriculum to these examples based on their ability to deceive the model. We then fine-tune a biased model using the curriculum of ASACs generated. We validate our approach both qualitatively and quantitatively using experiments on various classifiers trained on the CelebA [33] and LFW [20] datasets. Our findings indicate that the proposed training approach not only improves fairness metrics but also maintains or enhances the overall performance of the model. In addition, we show that our method generalizes to biased models of varying scales (6M-24M). Our contributions can be summarized as follows: we introduce a bias-averse method for generating image counterfactuals using adversarial images to create Attribute-Specific Adversarial Counterfactuals (ASACs) for sensitive attributes. We also present a novel curriculum learning-based fine-tuning approach that leverages ASACs to mitigate pre-existing biases in vision models. Additionally, we provide a comprehensive evaluation across various datasets, architectures and metrics, demonstrating that our method enhances fairness metrics without compromising the model’s overall performance.

2 Related Works

Refer to caption
Figure 2: Bias Mitigation Strategy: Our proposed solution for mitigating biases in a model (e.g., smile classifier) M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ) involves training sensitive attribute classifier C(θ,ϕ)𝐶𝜃italic-ϕC(\theta,\phi)italic_C ( italic_θ , italic_ϕ ) (shown in the network architecture). We then follow a three-stage pipeline. (1) We generate ASACs that are capable of deceiving C(θ,ϕ)𝐶𝜃italic-ϕC(\theta,\phi)italic_C ( italic_θ , italic_ϕ ). (2) We define a curriculum assignment strategy that organizes these ASACs on based on the degree to which they deceive the original model M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ). (3) We fine-tune the original model M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ) using the organized ASACs and the proposed loss function (see Equation 5).
Adversarial examples

Adversarial examples have emerged as a notable challenge in the realm of computer vision[37, 58, 51]. These examples are crafted with the specific goal of deceiving machine learning models by making subtle alterations to input data, resulting in model misclassification. Adversarial attacks are specialized algorithms designed to generate such examples and have been the subject of extensive research. These attacks are categorized into black-box [44, 21, 40] and white-box attacks [51, 37], depending on the attacker’s access to model parameters. Our approach uses two white-box methods, FGSM [14] and PGD[35].

Bias mitigation in computer vision

Bias mitigation strategies in computer vision models can be categorized into three main approaches depending on when they are deployed into the machine learning pipeline: pre-processing, in-processing, and post-processing approaches. Pre-processing methods [43, 45, 63] focus on addressing biases before training the models. This is achieved by either adjusting the data distribution or strategically augmenting existing data. In-processing techniques[46, 48, 62] try to improve the model during the training phase either by proposing an alternate training strategy or by changing the model structure. Post-processing strategies [59, 34] adjust fairness criteria after the model is trained. Our approach falls within the post-processing category as it is a method to fine-tune an already existing model concerning a protected attribute.

Counterfactuals and their role in machine learning

The notion of using counterfactuals in machine learning models is closely related to the definition of counterfactual fairness proposed by Kusner et al. [30]. Counterfactuals have emerged as a versatile tool in machine learning [26, 39, 49], natural language processing [8, 27], and computer vision [2, 13, 61, 1, 12, 3, 11, 9, 60, 63]. Counterfactual images have been used to explain [2, 13, 61], analyze [1, 12, 3, 10] and improve fairness [11, 9] in computer vision models.

The Relationship Between Counterfactual and Adversarial Examples

While the connection between counterfactuals and adversarial images has been extensively explored in theoretical machine learning[41, 52, 53, 24, 5], limited attention has been given to this relationship in computer vision[63, 42, 31, 57]. Among these approaches, work by Wang et al. [60] and Zhang et al.[63] is particularly noteworthy because of its proximity to our approach. Wang et al. proposed a post-processing approach to mitigate biases using adversarial examples, uniquely incorporating a GAN-based loss function. Conversely, Zhang et al. employ an architecture similar to ours, generating adversarial perturbations to counteract biases. Our research however, is distinct in proposing a novel training setup that introduces a curriculum learning strategy along with a loss objective for fine-grained control. Unlike previous methods that often rely on a single adversarial example, our approach systematically assesses the impact of each adversarial example on the original classifier, providing a more directed training regime.

Curriculum Learning

Curriculum learning is an effective training paradigm that gradually exposes models to progressively more complex examples, aiding in better generalization [4, 22, 36, 29, 15]. This approach strategically organizes training data, facilitating smoother convergence and improved performance. Several approaches have since been proposed to organize data that include sorting the data [4], adaptive organization [36], self-paced curriculum assignment [36], and teacher-based curriculum learning [36].

3 Method

Notations

We define the data distribution as D𝐷Ditalic_D, from which image-label pairs (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) are sampled i.i.d., both for train and test data. Protected attribute pairs are denoted as (xa,ya)subscript𝑥𝑎subscript𝑦𝑎(x_{a},y_{a})( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) and are sampled from the same data distribution as (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) with adversarial perturbations relative to an attribute-specific image xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT represented by δxsubscript𝛿𝑥\delta_{x}italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. As an example consider a smile classifier with gender as the protected attribute. Here (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is the input and true label of the smile classifier whereas xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT denotes the image when the input image is x𝑥xitalic_x is perturbed. We define the target attribute classifier as M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ), using a backbone network θ𝜃\thetaitalic_θ and a linear probe ρ𝜌\rhoitalic_ρ, while the protected attribute classifier as C(θ,ϕ)𝐶𝜃italic-ϕC(\theta,\phi)italic_C ( italic_θ , italic_ϕ ), sharing the backbone θ𝜃\thetaitalic_θ but with a different linear probe ϕitalic-ϕ\phiitalic_ϕ. Predictions from these classifiers are Y^=M(θ,ρ,x)^𝑌𝑀𝜃𝜌𝑥\hat{Y}=M(\theta,\rho,x)over^ start_ARG italic_Y end_ARG = italic_M ( italic_θ , italic_ρ , italic_x ) for the target and Y¯=C(θ,ϕ,xa)¯𝑌𝐶𝜃italic-ϕsubscript𝑥𝑎\bar{Y}=C(\theta,\phi,x_{a})over¯ start_ARG italic_Y end_ARG = italic_C ( italic_θ , italic_ϕ , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) for the protected attribute.

Our method for creating attribute-specific adversarial counterfactuals (ASACs) and employing them to fine-tune a biased model can be broken down into three stages. First, we generate ASACs for a set of input images X={x1,x2,,xn}𝑋superscript𝑥1superscript𝑥2superscript𝑥𝑛X=\{x^{1},x^{2},\ldots,x^{n}\}italic_X = { italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }, aiming to mislead the classification model M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ) regarding a protected attribute a𝑎aitalic_a (detailed in Section 3.1). For example, when training a smile classifier, we produce adversarial images that prompt misclassification based on the protected attribute of gender. The next step (Section 3.2) evaluates the effectiveness of the generated ASACs. Continuing with our example, we measure the capacity of the ASACs to incorrectly influence the model’s smile classification, despite the ASACs being generated to confuse the model about gender. We assign a training curriculum based on the ASAC’s success in deceiving the model. This procedure can be seen in Algorithm 1. Finally, the concluding stage utilizes the curriculum and ASACs produced, and introduces an adversarial loss (Section 3.3) to fine-tune the original model M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ). This aims to reduce biases within the model and enhance its discriminative performance. The training setup and hyperparameters are detailed in Section B.4.

3.1 Generating attribute-specific adversarial examples

This section details our approach to constructing ASACs - adversarial images designed to mislead a model concerning a specific sensitive attribute, denoted as a𝑎aitalic_a. We start by employing a target classifier M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ) and develop an attribute classifier, C(θ,ϕ)𝐶𝜃italic-ϕC(\theta,\phi)italic_C ( italic_θ , italic_ϕ ), to predict the protected attribute a𝑎aitalic_a within the dataset D𝐷Ditalic_D. Both models share a similar structure, with only the last layer differing, as depicted in Figure 2, thereby utilizing nearly identical representations for classification. They are trained on identical data distributions D𝐷Ditalic_D. To construct each ASAC xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, we augment the original image x𝑥xitalic_x with noise δxasubscript𝛿subscript𝑥𝑎\delta_{x_{a}}italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT using common methods to generate adversarial images as shown below.

xa=x+δxasubscript𝑥𝑎𝑥subscript𝛿subscript𝑥𝑎x_{a}=x+\delta_{x_{a}}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_x + italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT (1)
Algorithm 1 Construct Minibatch for Training based on Curriculum
1:function ConstructCurriculum(M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ), {xa(1),,xa(k)}subscript𝑥𝑎1subscript𝑥𝑎𝑘\{x_{a}(1),\dots,x_{a}(k)\}{ italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( 1 ) , … , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_k ) }, {ϵ(1),,ϵ(l)}italic-ϵ1italic-ϵ𝑙\{\epsilon(1),\dots,\epsilon(l)\}{ italic_ϵ ( 1 ) , … , italic_ϵ ( italic_l ) }, fattacksubscript𝑓𝑎𝑡𝑡𝑎𝑐𝑘f_{attack}italic_f start_POSTSUBSCRIPT italic_a italic_t italic_t italic_a italic_c italic_k end_POSTSUBSCRIPT, order)
2:M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ): Target attribute classifier
3:{xa(1),,xa(k)}subscript𝑥𝑎1subscript𝑥𝑎𝑘\{x_{a}(1),\dots,x_{a}(k)\}{ italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( 1 ) , … , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_k ) }: Attribute-specific minibatch
4:{ϵ1,,ϵl}subscriptitalic-ϵ1subscriptitalic-ϵ𝑙\{\epsilon_{1},\dots,\epsilon_{l}\}{ italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }: Curriculum of noise magnitudes
5:fattacksubscript𝑓𝑎𝑡𝑡𝑎𝑐𝑘f_{attack}italic_f start_POSTSUBSCRIPT italic_a italic_t italic_t italic_a italic_c italic_k end_POSTSUBSCRIPT: Any scalable adversarial attack
6:order: Order to sort ASACs in
7:     minibatch{}minibatch\text{minibatch}\leftarrow\{\}minibatch ← { }
8:     for i1𝑖1i\leftarrow 1italic_i ← 1 to k𝑘kitalic_k do
9:         for j1𝑗1j\leftarrow 1italic_j ← 1 to l𝑙litalic_l do
10:              𝐱𝐚fattack(xa(i),ϵj)subscriptsuperscript𝐱𝐚subscript𝑓𝑎𝑡𝑡𝑎𝑐𝑘subscript𝑥𝑎𝑖subscriptitalic-ϵ𝑗\mathbf{x^{\prime}_{a}}\leftarrow f_{attack}(x_{a}(i),\epsilon_{j})bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT ← italic_f start_POSTSUBSCRIPT italic_a italic_t italic_t italic_a italic_c italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_i ) , italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) \triangleright Construct ASAC with Eq. 1
11:              scoreDS(𝐱𝐚;M(θ,ρ))score𝐷𝑆subscriptsuperscript𝐱𝐚𝑀𝜃𝜌\text{score}\leftarrow DS(\mathbf{x^{\prime}_{a}};M(\theta,\rho))score ← italic_D italic_S ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT ; italic_M ( italic_θ , italic_ρ ) ) \triangleright Compute Difficulty Score with Eq. 4
12:              minibatch.append(𝐱𝐚,score))\text{minibatch}.\text{append}(\mathbf{x^{\prime}_{a}},\text{score}))minibatch . append ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT , score ) )
13:         end for
14:     end for
15:     minibatch.sort(key=lambda x:x[1])\text{minibatch}.\text{sort}(key=\text{lambda }x:x[1])minibatch . sort ( italic_k italic_e italic_y = lambda italic_x : italic_x [ 1 ] ) \triangleright Sort minibatch by difficulty score
16:     return minibatch
17:end function

In our work, we experiment with two commonly used adversarial methods, Fast Gradient Sign Method (FGSM) [14] and Projected Gradient Descent (PGD) [35], to generate images capable of deceiving the attribute classifier. FGSM is a single-step attack approach that perturbs input images based on the gradient sign of the loss function, as shown in Equation 2, with ϵ[0,1]italic-ϵ01\epsilon\in[0,1]italic_ϵ ∈ [ 0 , 1 ] denoting a scaling factor which we refer to as the noise magnitude. PGD iteratively perturbs input data similar to FGSM to maximize the loss function, aiming to find the smallest perturbation that causes misclassification.

δx=ε×sign(xJ(θ,ϕ,xa,y))subscript𝛿𝑥𝜀signsubscript𝑥𝐽𝜃italic-ϕsubscript𝑥𝑎𝑦\delta_{x}=\varepsilon\times\text{sign}\left(\nabla_{x}J(\theta,\phi,x_{a},y)\right)italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_ε × sign ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_J ( italic_θ , italic_ϕ , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y ) ) (2)

3.2 Curriculum Learning

We propose a curriculum learning based strategy for training our model. Our approach comprises two main stages. First, we evaluate the influence of each ASAC on the classification model M(θ,ϕ)𝑀𝜃italic-ϕM(\theta,\phi)italic_M ( italic_θ , italic_ϕ ) by assessing its difficulty score. Subsequently, the model implements a curriculum for training, guided by the difficulty scores of all examples.

3.2.1 Computing Difficulty Scores

The difficulty score is crucial in curriculum learning, as it measures the utility of each example and determines the order in which ASACs are presented to the model during adversarial training [29]. However, obtaining the optimal difficulty score a priori is not feasible in our setting. To address this, we define the difficulty score for any attribute-specific image xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as the complement of the softmax value that the model M𝑀Mitalic_M returns for xasubscript𝑥𝑎x_{a}italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

DS(xa)=1Softmax(xa;Mθ,ρ)=1exp(Mθ,ρ(xa))i=1Nexp(Mθ,ρ(xa(i)))𝐷𝑆subscript𝑥𝑎1Softmaxsubscript𝑥𝑎subscript𝑀𝜃𝜌1subscript𝑀𝜃𝜌subscript𝑥𝑎superscriptsubscript𝑖1𝑁subscript𝑀𝜃𝜌subscript𝑥𝑎𝑖DS(x_{a})=1-\text{Softmax}(x_{a};M_{\theta,\rho})=1-\frac{\exp(M_{\theta,\rho}% (x_{a}))}{\sum_{i=1}^{N}\exp(M_{\theta,\rho}(x_{a}(i)))}italic_D italic_S ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = 1 - Softmax ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_M start_POSTSUBSCRIPT italic_θ , italic_ρ end_POSTSUBSCRIPT ) = 1 - divide start_ARG roman_exp ( italic_M start_POSTSUBSCRIPT italic_θ , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_M start_POSTSUBSCRIPT italic_θ , italic_ρ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_i ) ) ) end_ARG (3)

where N𝑁Nitalic_N is the number of classes for the target outcome. Note that here we use ASACs for adversarial curriculum learning, therefore, we rewrite Equation (3) as shown below:

DS(xa)=1Softmax(xa;Mθ,ρ)𝐷𝑆subscriptsuperscript𝑥𝑎1Softmaxsubscriptsuperscript𝑥𝑎subscript𝑀𝜃𝜌DS(x^{\prime}_{a})=1-\text{Softmax}(x^{\prime}_{a};M_{\theta,\rho})italic_D italic_S ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = 1 - Softmax ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_M start_POSTSUBSCRIPT italic_θ , italic_ρ end_POSTSUBSCRIPT ) (4)

A subtle but key point to note here is that the model used to determine the difficulty score and ultimately the curriculum is the target classifier M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ), however, the examples being scored are the ASACs generated, i.e., xasubscriptsuperscript𝑥𝑎x^{\prime}_{a}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT which are generated as a function of the protected attributed classifier C(θ,ϕ)𝐶𝜃italic-ϕC(\theta,\phi)italic_C ( italic_θ , italic_ϕ ).

3.2.2 Curriculum Assignment

For each minibatch of k𝑘kitalic_k attribute-specific images {xa(1),,xa(k)}subscript𝑥𝑎1subscript𝑥𝑎𝑘\{x_{a}(1),\dots,x_{a}(k)\}{ italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( 1 ) , … , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_k ) }, we compute the Cartesian product of ASACs with the set of l𝑙litalic_l noise magnitudes we use in the curriculum (including ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0). For each of the samples in the resulting minibatch of size k×l𝑘𝑙k\times litalic_k × italic_l, we compute a difficulty score, DS(xa)𝐷𝑆subscriptsuperscript𝑥𝑎DS(x^{\prime}_{a})italic_D italic_S ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ), detailed in Equation 4, that scores an ASAC based on the difficulty of the example for the target classifier to classify correctly. We then sort the minibatch based on the difficulty of each sample. See Table 4 for an ablation on the effect of a randomized, increasing, and decreasing order (w.r.t. difficulty score) for samples in the curriculum. This procedure is illustrated in detail in Algorithm 1 and enables what defines each in-batch curriculum for corresponding minibatches in the fine-tuning process of the target classifier M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ).

3.3 Fine-Grained Control using Adversarial Loss

It is important to note that in all curriculums experimented with, we include the original minibatch, i.e., ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0. When performing any kind of adversarial training, there exists the possibility of overfitting, especially when fine-tuning a model over the same or similar data it was originally trained on. Though overfitting may lead to improved fairness metrics, it results in a lower degree of generalization of the model. While fine-tuning the target classifier M(θ,ρ)𝑀𝜃𝜌M(\theta,\rho)italic_M ( italic_θ , italic_ρ ), we use Cross Entropy (CE) as our loss function (J𝐽Jitalic_J); however, any fully differentiable loss function may be used. We extend our approach to address the issue of overfitting and the ability to have fine-grained control over the trade-off between accuracy and fairness by introducing an adversarial regularization term, α𝛼\alphaitalic_α. We construct our loss function J^^𝐽\hat{J}over^ start_ARG italic_J end_ARG to contain two terms, both governed by a convex combination of α𝛼\alphaitalic_α as shown in the equation below:

J^α(θ,ρ,xa;xa,y)=αJ(θ,ρ,xa,y)+(1α)J(θ,ρ,xa,y)subscript^𝐽𝛼𝜃𝜌subscript𝑥𝑎subscriptsuperscript𝑥𝑎𝑦𝛼𝐽𝜃𝜌subscript𝑥𝑎𝑦1𝛼𝐽𝜃𝜌subscriptsuperscript𝑥𝑎𝑦\displaystyle\hat{J}_{\alpha}(\theta,\rho,x_{a};x^{\prime}_{a},y)=\alpha J(% \theta,\rho,x_{a},y)+(1-\alpha)J(\theta,\rho,x^{\prime}_{a},y)over^ start_ARG italic_J end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_θ , italic_ρ , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ; italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y ) = italic_α italic_J ( italic_θ , italic_ρ , italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y ) + ( 1 - italic_α ) italic_J ( italic_θ , italic_ρ , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y ) (5)

4 Results

In this section, we present and discuss various experiments to evaluate and interpret our proposed approach. We evaluate our approach through quantitative performance metrics and fairness evaluations, followed by qualitative analyses using visualization methods for model understanding such as Integrated Gradients (IG) [50]. In addition, we conduct ablations on different curriculum policies, noise magnitudes (ϵitalic-ϵ\epsilonitalic_ϵ) and network architectures. We use two popular datasets, CelebA [33] and Labeled Faces in the Wild (LFW) [20]. To evaluate the fairness of our model, we employ three key metrics: the Difference in Demographic Parity (DDP), the Difference in Equalized Odds (DEO), and the Difference in Equalized Opportunity (DEOp) [16]. Alongside these, we also measure the model’s accuracy (ACC). Lower values for fairness metrics indicate a less biased model whereas a high value in accuracy indicates a more performant model. For more details about the metrics and their definition please refer to the supplementary material (Section A.1).

4.1 Quantitative Results

Table 1: A comparison of our approach with other approaches [63, 45] on the CelebA dataset. We also test on a variety of backbone architectures.
Label Method DEO DEOp DDP ACC
Smile Base Classifier 0.088 0.066 0.15 84.29
Adversarial Training [63] 0.077 0.051 0.17 91.74
Counterfactual GAN based training [45] 0.065 0.050 0.17 86.50
ASAC (with FGSM) 0.050 0.043 0.15 91.91
ASAC (with PGD ) 0.058 0.045 0.14 91.20
Big Nose Base Classifier 0.332 0.257 0.28 80.77
Adversarial Training [63] 0.396 0.311 0.35 81.53
Counterfactual GAN based training [45] 0.392 0.301 0.36 78.77
ASAC (with FGSM) 0.354 0.267 0.30 82.05
ASAC (with PGD) 0.374 0.281 0.31 81.03
Wavy Hair Base Classifier 0.347 0.237 0.35 81.15
Adversarial Training [63] 0.359 0.264 0.40 82.07
Counterfactual GAN based training [45] 0.344 0.230 0.34 79.21
ASAC (with FGSM) 0.34 0.29 0.342 82.01
ASAC (with PGD) 0.383 0.243 0.34 81.99
Table 2: Performance across accuracy and fairness metrics on the LFW dataset.
Label Method   DEO   DEOp   DDP   ACC
Smile Base Classifier 0.074 0.071 0.19 71.29
Adversarial Training [63] 0.065 0.059 0.18 84.74
ASAC (with FGSM) 0.054 0.050 0.16 89.19
ASAC (with PGD) 0.063 0.060 0.16 88.41
Bags Under Eyes Base Classifier 0.074 0.073 0.16 69.29
Adversarial Training [63] 0.064 0.061 0.14 74.27
ASAC (with FGSM) 0.061 0.060 0.14 87.29
ASAC (with PGD) 0.066 0.062 0.17 86.42
Wavy Hair Base Classifier 0.082 0.072 0.18 63.29
Adversarial Training [63] 0.065 0.064 0.19 73.27
ASAC (with FGSM) 0.066 0.062 0.19 88.29
ASAC (with PGD) 0.065 0.067 0.16 88.70
Table 3: Comparison of accuracy and fairness metrics between base classifier and fine-tuning using the proposed approach across different backbone architectures.
Model Backbone Network DEO DEOp DDP ACC
Base Classifier ResNet-18 0.088 0.066 0.150 84.29
Base Classifier ResNet-50 0.077 0.064 0.148 86.60
Base Classifier DenseNet-121 0.090 0.075 0.146 80.36
Base Classifier DenseNet-169 0.092 0.075 0.139 80.95
ASAC ResNet-18 0.050 0.043 0.150 91.91
ASAC ResNet-50 0.047 0.042 0.137 90.78
ASAC DenseNet-121 0.052 0.046 0.153 91.32
ASAC DenseNet-169 0.070 0.051 0.158 91.08
Comparing the proposed method across different target attributes on different datasets

As shown in Table 1, we quantitatively evaluate the performance and fairness of our approach on the CelebA[33] dataset, focusing on target attributes such as Smile, Big Nose, and Wavy Hair, with gender as the protected attribute. Using ResNet-18[17] as the backbone, we compare models trained with and without our proposed approach and evaluate against the original baseline model and two different methods: an adversarial training approach [63] and a debiasing approach by Ramaswamy et al.[45] that generates counterfactual images using GANs. Our approach, utilizing FGSM [14] and PGD [35] adversarial methods (with ϵ={0,0.01,0.001}italic-ϵ00.010.001\epsilon=\{0,0.01,0.001\}italic_ϵ = { 0 , 0.01 , 0.001 }) and α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, outperforms these previous methods in terms of both accuracy and fairness metrics. While the GAN-based counterfactual approach [45] appears to reduce the overall accuracy of the system post training, our method not only improves accuracy for all three target attributes, but also improves fairness metrics (for all target attributes except big nose). We conduct the same analysis on the LFW dataset [20], targeting the attributes Smiling, Bags Under Eyes, and Wavy Hair, considering the influence of Gender as the protected attribute. As shown in Table 2, our findings on our proposed approach on the LFW dataset [20] are consistent with those on the CelebA dataset [33], indicating an increase in overall accuracy and improvements across most fairness metrics.

Comparing the proposed approach across different backbones

Table 3 presents a quantitative comparison of our approach using different backbones (ResNet-18 [17], ResNet-50[17], DenseNet-121[19], and DenseNet-169[19]) on the CelebA dataset [33]. Similar to our evaluation on the CelebA and LFW datasets, we use the FGSM attack with ϵ={0,0.01,0.001}italic-ϵ00.010.001\epsilon=\{0,0.01,0.001\}italic_ϵ = { 0 , 0.01 , 0.001 } and to construct the ASACs in each minibatch and α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 in the loss function (Equation 5). Our method obtains significant improvements in accuracy while also improving the fairness of the model, showing that our method generalizes across architectures and scale of the model.

4.2 Qualitative Results

4.2.1 Robustness evaluation of adversarial training

To assess the impact of adversarial training using ASACs, we examine how adversarial images designed to attack a gender classifier affect the decision of the smile classifier, pre and post-training with our proposed approach. Given an image, we add varying magnitudes of adversarial noise and provide the corrupted images to the smile classifier. We observe that before using our proposed method, the adversarial noise intended to mislead the gender classifier also misled the smile classifier, indicating a possible correlation between gender and smile attributes in these examples. Such an outcome suggests that manipulations the target gender can influence predictions on e.g., smile prediction. We perform an identical analysis on the models after our proposed fine-tuning approach. We observe that after fine-tuning, the target classifier improves its discriminative ability while the protected attribute classifier used to construct ASACs continues to be deceived. This indicates that the fine-tuning procedure helps to decouple spurious correlations between target and protected attribute. This enables the target classifier to make more accurate predictions independent from the protected attribute. Qualitative examples of these results have been moved to the Supplementary section (Figure 4 ) due to space constraints.

Integrated Gradients

To assess the impact of ASAC-driven adversarial and curriculum-based fine-tuning, we observe the effect our approach has on individual samples in the test set using Integrated Gradients (IG) [50]. Integrated Gradients is a widely recognized interpretability technique that highlights pixels influencing decision-making process of the model. In Figure 3, we juxtapose the integrated gradients of decisions made by the classifier in both pre and post ASAC training settings. For more examples, see Section 6 in the supplementary material. After applying our proposed approach, the integrated gradients converge towards the mouth area in the images. As humans, we can agree that the mouth is the largest indicator of whether a person is smiling or not. Our observations suggest that ASAC-driven training compel the model to focus on regions of the image more closely correlated with a target attribute, such as smile in this case. Consequently, these findings indicate that the model fine-tuned using ASACs may give more importance to regions closer to the mouth compared to the original classifier (pre fine-tuning).

Refer to caption
Figure 3: We look at the Integrated Gradients and for samples pre (IG before) and post (IG after) training.

4.3 Ablation Study

In our ablation study, we use a ResNet-18 backbone with the target attribute as smile and the protected attribute as gender. We utilize FGSM with ϵ={0,0.01,0.001}italic-ϵ00.010.001\epsilon=\{0,0.01,0.001\}italic_ϵ = { 0 , 0.01 , 0.001 } to generate the ASACs for each minibatch on the CelebA dataset and α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 for the loss function (Equation 5).

Table 4: A comparison of different curriculum learning strategies.
Method     DEO     DEOp     DDP     ACC
W/O Curriculum Learning 0.055 0.044 0.14 91.23
Curriculum Learning in Ascending order 0.050 0.043 0.15 91.79
Curriculum Learning in Descending order 0.058 0.048 0.17 92.08
Table 5: Ablation on the effect of different values of noise magnitudes (ϵitalic-ϵ\epsilonitalic_ϵ) on the performance and fairness metrics of the target classifier.
Parameter     Setting     DEO     DEOp     DDP     ACC
ϵitalic-ϵ\epsilonitalic_ϵ {0.001, 0.01} 0.050 0.043 0.15 91.91
{0.001,0.03} 0.055 0.043 0.14 91.42
{0.001,0.05} 0.057 0.047 0.14 91.38

Evaluating the impact of Curriculum Learning In this study, we evaluate the models performance over three training strategies: curriculum learning where the order of ASACs in a minibatch is random, where the difficulty (confidence of the model in misclassification) gradually increases, and starting with the most difficult examples i.e., the difficulty decreases. Results (Table 4) show that while beginning with challenging examples boosts accuracy, saving them for later enhances fairness metrics. This highlights the importance of curriculum design in balancing accuracy and fairness.

Finding the most optimal noise magnitudes

We examine the effect varying noise magnitudes on the performance of our smile classifier (Table 5). This study provides insights into selecting noise magnitudes to enhance model robustness and effectiveness. We find that noise magnitudes of ϵ=0.03italic-ϵ0.03\epsilon=0.03italic_ϵ = 0.03 and ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05 yield optimal results. Further experiments are detailed in the supplementary material (Section 6).

5 Discussion and Conclusion

We proposed a method for utilizing attribute-specific adversarial examples as counterfactuals to enhance fairness in computer vision models. Our method fine-tunes models with these counterfactuals using a novel curriculum learning training regime. Our experiments on CelebA and LFW datasets demonstrate that adversarial fine-tuning can significantly improve fairness metrics without sacrificing accuracy. Qualitative results indicate the model’s potential ability to disentangle predictions from protected attributes. Ablation studies justify different design choices, notably revealing that starting with challenging examples may lead to a more accurate model, while presenting easier examples earlier results in a fairer but less accurate model. This underscores the impact of curriculum learning on the problem. For our future work, we plan to investigate the impact of various adversarial attacks on the model. Additionally, we aim to explore how adversarial images can be utilized to interpret the relationship between model decisions and the underlying classifiers.

Ethical Considerations.

Our work targets the ethical concerns surrounding biases in computer vision models. Further, our approach can be used to mitigate biases by fine-tuning existing biased models and can also deployed in a low-data setting. The main limitations of our work are that (1) it requires the backbone of our model and cannot be applied on a black-box model, and (2) it relies on knowledge of the protected features, which is sometimes inaccessible. (3)Finally, we only assess gender as a sensitive attribute. The rationale behind this experimental choice is detailed in the Supplementary material B.3

6 Acknowledgements

This work was supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness, the Sloan Foundation Grant 2020-13941, and the Simons Foundation investigators award 689988. The authors would like to thank these organizations for their generous support, which has been instrumental in advancing our research.

References

  • [1] E. Abbasnejad, D. Teney, A. Parvaneh, J. Shi, and A. v. d. Hengel. Counterfactual vision and language learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10044–10054, 2020.
  • [2] A. Abid, M. Yuksekgonul, and J. Zou. Meaningfully debugging model mistakes using conceptual counterfactual explanations. In International Conference on Machine Learning, pages 66–88. PMLR, 2022.
  • [3] G. Balakrishnan, Y. Xiong, W. Xia, and P. Perona. Towards causal benchmarking of biasin face analysis algorithms. In Deep Learning-Based Face Analytics, pages 327–359. Springer, 2021.
  • [4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  • [5] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075, 2017.
  • [6] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
  • [7] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
  • [8] Z. Chen, Q. Gao, A. Bosselut, A. Sabharwal, and K. Richardson. Disco: distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528, 2023.
  • [9] J. Cheong, S. Kalkan, and H. Gunes. Counterfactual fairness for facial expression recognition. In European Conference on Computer Vision, pages 245–261. Springer, 2022.
  • [10] A. Chinchure, P. Shukla, G. Bhatt, K. Salij, K. Hosanagar, L. Sigal, and M. Turk. Tibet: Identifying and evaluating biases in text-to-image generative models. arXiv preprint arXiv:2312.01261, 2023.
  • [11] S. Dash, V. N. Balasubramanian, and A. Sharma. Evaluating and mitigating bias in image classifiers: A causal perspective using counterfactuals. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 915–924, 2022.
  • [12] E. Denton, B. Hutchinson, M. Mitchell, T. Gebru, and A. Zaldivar. Image counterfactual sensitivity analysis for detecting unintended bias. arXiv preprint arXiv:1906.06439, 2019.
  • [13] A. Feder, N. Oved, U. Shalit, and R. Reichart. Causalm: Causal model explanation through counterfactual language models. Computational Linguistics, 47(2):333–386, 2021.
  • [14] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [15] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. In international conference on machine learning, pages 1311–1320. Pmlr, 2017.
  • [16] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29, 2016.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [18] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European conference on computer vision (ECCV), pages 771–787, 2018.
  • [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [20] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
  • [21] L. Jiang, X. Ma, S. Chen, J. Bailey, and Y.-G. Jiang. Black-box adversarial attacks on video recognition models. In Proceedings of the 27th ACM International Conference on Multimedia, pages 864–872, 2019.
  • [22] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. Hauptmann. Self-paced curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  • [23] A. R. Joshi, X. S. Cuadros, N. Sivakumar, L. Zappella, and N. Apostoloff. Fair sa: Sensitivity analysis for fairness in face recognition. In Algorithmic fairness through the lens of causality and robustness workshop, pages 40–58. PMLR, 2022.
  • [24] A.-H. Karimi, G. Barthe, B. Balle, and I. Valera. Model-agnostic counterfactual explanations for consequential decisions. In International Conference on Artificial Intelligence and Statistics, pages 895–905. PMLR, 2020.
  • [25] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • [26] A. Kasirzadeh and A. Smart. The use and misuse of counterfactuals in ethical machine learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 228–236, 2021.
  • [27] D. Kaushik, E. Hovy, and Z. C. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. arXiv preprint arXiv:1909.12434, 2019.
  • [28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [29] Y. Kong, L. Liu, J. Wang, and D. Tao. Adaptive curriculum learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5067–5076, 2021.
  • [30] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. Advances in neural information processing systems, 30, 2017.
  • [31] J. Lim, Y. Kim, B. Kim, C. Ahn, J. Shin, E. Yang, and S. Han. Biasadv: Bias-adversarial augmentation for model debiasing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3832–3841, 2023.
  • [32] B. Liu, W. Deng, Y. Zhong, M. Wang, J. Hu, X. Tao, and Y. Huang. Fair loss: Margin-aware reinforcement learning for deep face recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10052–10061, 2019.
  • [33] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [34] P. K. Lohia, K. N. Ramamurthy, M. Bhide, D. Saha, K. R. Varshney, and R. Puri. Bias mitigation post-processing for individual and group fairness. In Icassp 2019-2019 ieee international conference on acoustics, speech and signal processing (icassp), pages 2847–2851. IEEE, 2019.
  • [35] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • [36] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman. Teacher–student curriculum learning. IEEE transactions on neural networks and learning systems, 31(9):3732–3740, 2019.
  • [37] R. H. Maudslay, H. Gonen, R. Cotterell, and S. Teufel. It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5267–5275, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.
  • [38] N. Meister, D. Zhao, A. Wang, V. V. Ramaswamy, R. Fong, and O. Russakovsky. Gender artifacts in visual datasets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4837–4848, 2023.
  • [39] R. K. Mothilal, A. Sharma, and C. Tan. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pages 607–617, 2020.
  • [40] N. Narodytska and S. Kasiviswanathan. Simple black-box adversarial attacks on deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1310–1318, 2017.
  • [41] M. Pawelczyk, C. Agarwal, S. Joshi, S. Upadhyay, and H. Lakkaraju. Exploring counterfactual explanations through the lens of adversarial examples: A theoretical and empirical analysis. In International Conference on Artificial Intelligence and Statistics, pages 4574–4594. PMLR, 2022.
  • [42] H. Qiu, C. Xiao, L. Yang, X. Yan, H. Lee, and B. Li. Semanticadv: Generating adversarial examples via attribute-conditioned image editing. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 19–37. Springer, 2020.
  • [43] N. Quadrianto, V. Sharmanska, and O. Thomas. Discovering fair representations in the data domain. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8227–8236, 2019.
  • [44] A. Rahmati, S.-M. Moosavi-Dezfooli, P. Frossard, and H. Dai. Geoda: a geometric framework for black-box adversarial attacks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8446–8455, 2020.
  • [45] V. V. Ramaswamy, S. S. Kim, and O. Russakovsky. Fair attribute classification through latent space de-biasing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9301–9310, 2021.
  • [46] Y. Roh, K. Lee, S. Whang, and C. Suh. Fr-train: A mutual information-based approach to fair and robust training. In International Conference on Machine Learning, pages 8147–8157. PMLR, 2020.
  • [47] L. Seyyed-Kalantari, H. Zhang, M. B. McDermott, I. Y. Chen, and M. Ghassemi. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature medicine, 27(12):2176–2182, 2021.
  • [48] Y. Shi, X. Yu, K. Sohn, M. Chandraker, and A. K. Jain. Towards universal representation learning for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6817–6826, 2020.
  • [49] K. Sokol and P. A. Flach. Counterfactual explanations of machine learning predictions: Opportunities and challenges for ai safety. SafeAI@ AAAI, pages 1–4, 2019.
  • [50] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
  • [51] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • [52] B. Ustun, A. Spangher, and Y. Liu. Actionable recourse in linear classification. In Proceedings of the conference on fairness, accountability, and transparency, pages 10–19, 2019.
  • [53] A. Van Looveren and J. Klaise. Interpretable counterfactual explanations guided by prototypes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 650–665. Springer, 2021.
  • [54] S. Vandenhende, D. Mahajan, F. Radenovic, and D. Ghadiyaram. Making heads or tails: Towards semantically consistent visual counterfactuals. In European Conference on Computer Vision, pages 261–279. Springer, 2022.
  • [55] A. Wang, A. Liu, R. Zhang, A. Kleiman, L. Kim, D. Zhao, I. Shirai, A. Narayanan, and O. Russakovsky. Revise: A tool for measuring and mitigating bias in visual datasets. International Journal of Computer Vision, 130(7):1790–1810, 2022.
  • [56] A. Wang and O. Russakovsky. Overwriting pretrained bias with finetuning data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3957–3968, 2023.
  • [57] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 24–25, 2020.
  • [58] J. Wang. Adversarial examples in physical world. In IJCAI, pages 4925–4926, 2021.
  • [59] Z. Wang, X. Dong, H. Xue, Z. Zhang, W. Chiu, T. Wei, and K. Ren. Fairness-aware adversarial perturbation towards bias mitigation for deployed deep models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10379–10388, 2022.
  • [60] Z. Wang, K. Qinami, I. C. Karakozis, K. Genova, P. Nair, K. Hata, and O. Russakovsky. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8919–8928, 2020.
  • [61] T. Wu, M. T. Ribeiro, J. Heer, and D. S. Weld. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, 2021.
  • [62] P. Zeng, Y. Li, P. Hu, D. Peng, J. Lv, and X. Peng. Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23986–23995, 2023.
  • [63] Y. Zhang and J. Sang. Towards accuracy-fairness paradox: Adversarial example-based data augmentation for visual debiasing. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4346–4354, 2020.

Appendix A Supplementary Material

A.1 Fairness Definitions

To evaluate the fairness of our model, we employ three key metrics: the Difference in Demographic Parity (DDP), the Difference in Equalized Odds (DEO), and the Difference in Equalized Opportunity (DEOp). Alongside these, we also measure the model’s accuracy (ACC).

The DEO focuses on the difference in the rates of false negatives and false positives between genders. A substantial DEO indicates a bias in which one group is either less likely to be incorrectly dismissed or more likely to be inaccurately favored.

Definition 1 (Equalized Odds).

A predictor Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG satisfies equalized odds with respect to protected attribute A{a,a}𝐴𝑎superscript𝑎A\in\{a,a^{\prime}\}italic_A ∈ { italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } and outcome Y𝑌Yitalic_Y, if Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and A𝐴Aitalic_A are independent conditional on Y𝑌Yitalic_Y.

P(Y^=1|Y=y,A=a)=P(Y^=1|Y=y,A=a)where y{0,1}.P(\hat{Y}=1|Y=y,A=a)=P(\hat{Y}=1|Y=y,A=a^{\prime})\quad\text{where }y\in\{0,1\}.italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = italic_y , italic_A = italic_a ) = italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = italic_y , italic_A = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where italic_y ∈ { 0 , 1 } . (6)

To estimate how far a predictor Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG from having equalized odds, we estimate the Difference in Equalized Odds (DEO), i.e., we estimate

y{0,1}|P(Y^=1|Y=y,A=a)P(Y^=1|Y=y,A=a)|2.\frac{\sum_{y\in\{0,1\}}|P(\hat{Y}=1|Y=y,A=a)-P(\hat{Y}=1|Y=y,A=a^{\prime})|}{% 2}.divide start_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ { 0 , 1 } end_POSTSUBSCRIPT | italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = italic_y , italic_A = italic_a ) - italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = italic_y , italic_A = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | end_ARG start_ARG 2 end_ARG .

DDP quantifies the absolute gap in approval rates (e.g., smile prediction) across different protected attributes. A large DDP value suggests a tendency for individuals in a specific group to receive more favorable outcomes than their counterparts in the other group.

Definition 2 (Demographic Parity).

A predictor Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG satisfies demographic parity with respect to protected attribute A{a,a}𝐴𝑎superscript𝑎A\in\{a,a^{\prime}\}italic_A ∈ { italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } and outcome Y𝑌Yitalic_Y, if the following condition holds:

P(Y^=1|A=a)=P(Y^=1|A=a).𝑃^𝑌conditional1𝐴𝑎𝑃^𝑌conditional1𝐴superscript𝑎P(\hat{Y}=1|A=a)=P(\hat{Y}=1|A=a^{\prime}).italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_A = italic_a ) = italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_A = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (7)

To estimate how far a predictor Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG from having demographic parity, we estimate the Difference in Demographic Parity (DDP), i.e., we estimate

|P(Y^=1|A=a)P(Y^=1|A=a)|.|P(\hat{Y}=1|A=a)-P(\hat{Y}=1|A=a^{\prime})|.| italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_A = italic_a ) - italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_A = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | .

The Difference in Equalized Opportunity (DEOp) measures the disparity in True Positive Rates between different demographic groups in a model, indicating bias in favorable outcomes. A DEOp of zero signifies equal accuracy across groups, representing a fairness ideal in model performance. The difference in Equalized Opportunity is given as follows.

Definition 3 (Equalized Opportunity).

A predictor Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG satisfies equalized opportunity with respect to protected attribute A{a,a}𝐴𝑎superscript𝑎A\in\{a,a^{\prime}\}italic_A ∈ { italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } and outcome Y𝑌Yitalic_Y, if the following condition holds for a specific outcome value y𝑦yitalic_y:

P(Y^=1|Y=1,A=a)=P(Y^=1|Y=1,A=a).P(\hat{Y}=1|Y=1,A=a)=P(\hat{Y}=1|Y=1,A=a^{\prime}).italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = 1 , italic_A = italic_a ) = italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = 1 , italic_A = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

To measure how far a predictor Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG from having demographic parity, we estimate the Difference in Equalized Opportunity (DEOp), i.e., we estimate

|P(Y^=1|Y=1,A=a)P(Y^=1|Y=1,A=a)|.|P(\hat{Y}=1|Y=1,A=a)-P(\hat{Y}=1|Y=1,A=a^{\prime})|.| italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = 1 , italic_A = italic_a ) - italic_P ( over^ start_ARG italic_Y end_ARG = 1 | italic_Y = 1 , italic_A = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | .

Ideally, we want these values to be zero, indicating no disparity.

Appendix B Additional Ablation Studies

Table 6: Ablation on the effect of different values of ϵitalic-ϵ\epsilonitalic_ϵ and α𝛼\alphaitalic_α on the performance and fairness metrics of the target classifier utilizing FGSM for ASAC generation.
Parameter     Setting     DEO     DEOq     DDP     ACC
ϵitalic-ϵ\epsilonitalic_ϵ {0.001,0.03,0.05} 0.065 0.047 0.16 91.00
{0.001} 0.054 0.043 0.14 91.62
{0.05} 0.055 0.044 0.14 91.71
α𝛼\alphaitalic_α 0.3 0.06 0.04 0.16 92.04
0.5 0.058 0.050 0.16 91.84
0.7 0.057 0.047 0.14 91.90
Choosing an Optimal α𝛼\alphaitalic_α
Refer to caption
Figure 4: Qualitative results showing that our trained model becomes robust to ASACs after training.
Refer to caption
Figure 5: Examples where the adversarial noise is unable to flip the smile classifier. As shown in the figure adding adversarial noise does not change the decision of the smile classifier (red curve). Post training, the decision remains unchanged as well.
Refer to caption
Figure 6: More examples showcasing the change in integrated gradients for the smile classifier before and after applying our method.

In our third ablation study (Table 6), we examine the impact of different α𝛼\alphaitalic_α values (0.30.30.30.3, 0.50.50.50.5, and 0.70.70.70.7) on model performance. Lower α𝛼\alphaitalic_α (0.30.30.30.3) leads to higher accuracy, whereas α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 prioritizes fairness metrics. We select α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 for final model training, emphasizing the importance of balancing accuracy and fairness. Though a lower value of α𝛼\alphaitalic_α boosts accuracy, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 achieves favorable fairness outcomes, ensuring a more equitable model. These experiments are also conducted on a base ResNEt18 smile classifier trained on CelebA data.

Finding optimal noise magnitudes

This section shows some additional results for different noise magnitude combinations. These results have been shown in Table (6

B.1 Additional Robustness Results

This section shows some additional positive (Figure 4) and negative results (Figure 5) similar to the qualitative results. shown in Section 4.2. It should be noted that however, some samples are not impacted by adding adversarial noise and as a result, the values of the smile classifier do not change. These examples have been shown in Figure 5.

B.2 Additional IG Results

In Figure 6, we show more examples of how Integrated Gradients (IG) change on sample pictures after we use our new method. Integrated Gradients help us see which parts of the picture affect the model’s decisions the most. We did our tests the same way we explained in Section 4.2.1. Examples presented in Figure 6 show a similar trend to Figure 3 and the pixels that are used in decision-making converge from the entire towards the mouth region after applying our approach.

B.3 Comparison along other dimensions of bias

In this paper, we focus on gender as the primary dimension of bias. The primary reason for this choice is the availability of data and the clear presence of spurious correlations. For instance, in the CelebA dataset, both gender and smile attributes have nearly uniform distributions and show a positive pearson correlation of 0.230.230.230.23 . In contrast, other attributes like age in CelebA exhibit a highly skewed distribution (22%percent\%% young and 78%percent\%% old) and weak correlations with other uniformly distributed attributes (0.12 with smile, 0.11 with big nose, and 0.16 with wavy hair). Additionally, the CelebA dataset does not include any attributes related to race similarly the LFW data set does not have one label for race but multiple such labels such Indian, Black, White etc which makes it harder for us to draw a comparison. Consequently, our analysis is limited to gender bias.

B.4 Training setup

For our experiments, we used a training cluster utilizing a single Nvidia RTX A6000 GPU with 48 GB of GDDR6 memory, 384GB of RAM and 20 CPU cores. Each training/fine-tuning run took between 60 to 120 minutes to run. In addition, for experiments involving ResNet-18, we used an Apple M1 Pro chip with an 8-core CPU, 14-core GPU, 16-core Neural Engine and 16GB of RAM. For experiments of the same scale, we also used a Nvidia A100 with 40GB of memory and 25GB of RAM. Unless otherwise mentioned, we train and fine-tune models using a ResNet-18 backbone, with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and a noise magnitude setting of {0.01,0.001}0.010.001\{0.01,0.001\}{ 0.01 , 0.001 }. We train the base classifiers for 50 epochs and fine-tune using our curriculum learning approach for 10 epochs with both stages employing a learning rate of 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Across all experiments we use the Adam [28] optimizer with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999 and ϵ=1×108italic-ϵ1superscript108\epsilon=1\times 10^{-8}italic_ϵ = 1 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT and perform gradient clip** with a value of 1.01.01.01.0.