License: arXiv.org perpetual non-exclusive license
arXiv:2403.04547v1 [cs.LG] 07 Mar 2024

CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

Ibrahim Alabdulmohsin
Google Deepmind
Zürich, Switzerland
[email protected]
&Xiao Wang
Google Deepmind
Zürich, Switzerland
[email protected]
&Andreas Steiner
Google Deepmind
Zürich, Switzerland
[email protected]
&Priya Goyal
Google Deepmind
New York, USA
[email protected]
&Alexander D’Amour
Google Deepmind
Boston, USA
[email protected]
&Xiaohua Zhai
Google Deepmind
Zürich, Switzerland
[email protected]
Abstract

We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.

1 Introduction

Recent advances on multimodal systems have been breathtaking, including in zero-shot classification (Radford et al., 2021; Zhai et al., 2022b; Yu et al., 2022a; Jia et al., 2021), text-to-image generation (Saharia et al., 2022; Yu et al., 2022b; Chang et al., 2023; Rombach et al., 2022; Ramesh et al., 2022), image captioning (Alayrac et al., 2022; Chen et al., 2022) and music generation (Agostinelli et al., 2023) to name a few. However, such systems can inflict harm if left unchecked, such as by amplifying biases, causing performance disparities, or encoding narrow cultural perspectives.

Contrary to the traditional supervised learning setup, which is well-understood, multimodal systems present novel ethical challenges. They typically operate with an open-vocabulary, in which the set of input and/or output tokens is unbounded. Hence, statistical definitions for bias such as demographic parity (Dwork et al., 2012; Zafar et al., 2017; Mehrabi et al., 2019) or equalized odds (Hardt et al., 2016; Kleinberg et al., 2016) do not extend easily to such systems and might even yield contradictory results under different setups (Akyürek et al., 2022). Second, externalization in multimodal systems – by allowing users to interact with the system in ways that can disclose its internal reasoning – along with the open-vocabulary nature can expose users to biases in unanticipated ways. Third, data in multimodal systems is potentially biased, perpetuating societal stereotypes (Birhane et al., 2021).

For concreteness, consider the following examples. If we query the text-to-image Imagen (Saharia et al., 2022) with the prompt: “clipart picture of a manager dressed in black talking to a secretary dressed in blue,” we get the generated image samples shown in Figure 1 (left), in which managers are men and secretaries are women. Similarly, if we ask for pictures of pilots and flight attendants, we get the sample of pictures shown in Figure 1 (right). Similar examples are reported by Wang et al. (2023)Cho et al. (2022)Tan et al. (2020) and Mishkin et al. (2022) using GANs (Goodfellow et al., 2014), Stable Diffusion (Rombach et al., 2022) and DALL-E (Ramesh et al., 2021).

Moreover, zero-shot classifiers are also susceptible to biases as highlighted by  Hall et al. (2023) and Birhane et al. (2021), which we demonstrate in Figure 1 using contrastive image-language pretraining (CLIP) (Radford et al., 2021). Evidently, CLIP encodes societal stereotypes, such as by associating ties, cars, and tools with men. Such unintended biases can inflict harm by sha** social narratives. In addition, due to their open vocabulary nature, users can apply such models in unanticipated ways, e.g. so-called “corner cases” (Birhane et al., 2021), to promote prejudices.

Refer to caption Refer to caption Refer to caption Refer to caption      Refer to caption Refer to caption Refer to caption Refer to caption

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: top: Text-to-image models prompted for occupations, such as manager / secretary (left) or pilot / flight attendant (right) can reflect societal stereotypes. Refer to Section 1 for the exact prompts. bottom: CLIP can encode societal stereotypes, such as by associating cars with men. See Section 4.

In this work, we do not claim to offer a comprehensive solution to the complex issues discussed above. Rather, we examine in depth the effectiveness of one remediation strategy: data balancing. Specifically, we investigate the impact of debiasing data in multimodal systems that align embeddings/representations across modalities in the contrastive learning setup.

We focus on contrastive learning for the following reasons. First, it captures many of the complexities involved in multimodal systems, including open vocabulary, externalization, lack of data distribution at inference time, and lack of well-established definitions of bias. Second, such systems exhibit strong societal biases (Hall et al., 2023; Birhane et al., 2021). Third, they are more amenable to analysis than generative models (e.g. when studied in the zero-shot visual classification setting or in cross-modal retrieval). Fourth, they are increasingly used in critical domains like healthcare (Sellergren et al., 2022; Tiu et al., 2022; Zhang et al., 2023), and are popular ingredients in many models, as seen in Florence (Yuan et al., 2021), CoCa (Yu et al., 2022a), and OWL-ViT (Minderer et al., 2022a). Finally, biases in such models can manifest downstream, e.g. in captioning and retrieval (Berg et al., 2022).

To enable this analysis, we develop a data balancing algorithm, called Multi-Modal Moment Matching (M4), and analyze it theoretically (see Section 5). Using M4, we conduct a comprehensive evaluation of the effectiveness of data balancing for obtaining desirable model behavior, identifying areas of strength and limitation, while accounting for factors such as the model, representation, and training data sizes. We also examine how quickly CLIP learns/unlearns biases, among other evaluations. In total, we train over 150 models. To the best of our knowledge, an empirical evaluation of this magnitude for data balancing in multimodal systems has never been conducted before (see Section 6). Broadly speaking, our results are nuanced, and suggest that while data balancing has a positive impact on the model bias with a mixed impact on its quality, it is insufficient for obtaining fair behavior downstream. This echoes prior observations in vision; e.g.  Wang et al. (2019).

2 Preliminaries

For a brief overview, CLIP contains two towers for vision and language, which are encoder-only transformers (Vaswani et al., 2017). Denoting 𝐱=(𝐯,𝐭)𝐱𝐯𝐭\mathbf{x}=(\mathbf{v},\,\mathbf{t})bold_x = ( bold_v , bold_t ) for an image-text pair, let (𝐳v,𝐳t)r×rsuperscript𝐳𝑣superscript𝐳𝑡superscript𝑟superscript𝑟(\mathbf{z}^{v},\,\mathbf{z}^{t})\in\mathbb{R}^{r}\times\mathbb{R}^{r}( bold_z start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT be the corresponding outputs of the two towers, each is embedded in rsuperscript𝑟\mathbb{R}^{r}blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. We refer to r𝑟ritalic_r as the “representation size” in this paper. Given a fixed image 𝐯𝐯\mathbf{v}bold_v and a collection of captions (𝐭1,,𝐭n)subscript𝐭1subscript𝐭𝑛(\mathbf{t}_{1},\ldots,\mathbf{t}_{n})( bold_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) whose corresponding representations are (𝐳1t,,𝐳nt)r×nsubscriptsuperscript𝐳𝑡1subscriptsuperscript𝐳𝑡𝑛superscript𝑟𝑛(\mathbf{z}^{t}_{1},\ldots,\mathbf{z}^{t}_{n})\in\mathbb{R}^{r\times n}( bold_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_n end_POSTSUPERSCRIPT, CLIP assigns a probability score to each caption by first calculating the logits l=(𝐳v,𝐳kt)k[n]n𝑙subscriptsuperscript𝐳𝑣superscriptsubscript𝐳𝑘𝑡𝑘delimited-[]𝑛superscript𝑛l=(\langle\mathbf{z}^{v},\mathbf{z}_{k}^{t}\rangle)_{k\in[n]}\in\mathbb{R}^{n}italic_l = ( ⟨ bold_z start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ ) start_POSTSUBSCRIPT italic_k ∈ [ italic_n ] end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and then applying softmax normalization p=SoftMax(l/T)𝑝SoftMax𝑙𝑇p=\mathrm{SoftMax}(l/T)italic_p = roman_SoftMax ( italic_l / italic_T ), where T+𝑇superscriptT\in\mathbb{R}^{+}italic_T ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a learnable temperature.

To study the effectiveness of data bias mitigation in CLIP, we introduce a data balancing algorithm that tackles biases in first-order statistics (such as perceived gender imbalances) and second-order statistics (such as correlating occupations with a particular perceived gender).

Definition 1 (Data Representation Bias).

If 𝐬𝒟{0,1}msimilar-to𝐬𝒟superscript01𝑚\mathbf{s}\sim\mathcal{D}\in\{0,1\}^{m}bold_s ∼ caligraphic_D ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a sensitive attribute sampled i.i.d., the representation bias (RB) in 𝒟𝒟\mathcal{D}caligraphic_D with respect to a target π[0,1]m𝜋superscript01𝑚\pi\in[0,1]^{m}italic_π ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is: maxk[m]|πk𝔼𝒟[𝐬k]|subscript𝑘delimited-[]𝑚subscript𝜋𝑘subscript𝔼𝒟delimited-[]subscript𝐬𝑘\max_{k\in[m]}|\pi_{k}-\mathbb{E}_{\mathcal{D}}[\mathbf{s}_{k}]|roman_max start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] |.

To keep our analysis general, 𝐬{0, 1}m𝐬superscript01𝑚\mathbf{s}\in\{0,\,1\}^{m}bold_s ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in Definition 1 is a binary vector that encodes all (potentially overlap**) sensitive attribute categories, such as perceived gender and age groups. For instance, the first entry in 𝐬𝐬\mathbf{s}bold_s could be a binary indicator for the group “perceived women,” the second a binary indicator for the group “perceived men,” the next ten indicators for the Monk Skin Tone scale (Monk, 2019), and so on. We keep π𝜋\piitalic_π in our discussion arbitrary because the desired target distribution of categories may vary depending on the context, as discussed in (Berg et al., 2022).

Data representation bias (RB) measures differences in group prevalence; e.g. if π=(0.5,0.5)𝜋0.50.5\pi=(0.5,0.5)italic_π = ( 0.5 , 0.5 ) for “men” and “women,” then having only “men” in 80% of the images implies a significant RB in the data. Definition 1 defines RB w.r.t. π𝜋\piitalic_π to account for overlap** attributes, such as gender and race.

Definition 2 (Data Association Bias).

If (𝐬,𝐲)𝒮×𝒴𝐬𝐲𝒮𝒴(\mathbf{s},\,\mathbf{y})\in\mathcal{S}\times\mathcal{Y}( bold_s , bold_y ) ∈ caligraphic_S × caligraphic_Y, where 𝒮={0,1}m𝒮superscript01𝑚\mathcal{S}=\{0,1\}^{m}caligraphic_S = { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and 𝒴={0,1}c𝒴superscript01𝑐\mathcal{Y}=\{0,1\}^{c}caligraphic_Y = { 0 , 1 } start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, is sampled i.i.d. from a joint distribution 𝒟𝒟\mathcal{D}caligraphic_D, association bias (AB) in 𝒟𝒟\mathcal{D}caligraphic_D w.r.t. (𝒮,𝒴)𝒮𝒴(\mathcal{S},\,\mathcal{Y})( caligraphic_S , caligraphic_Y ) is defined as:

DataAssociationBias=maxk[m],r[c]|𝔼𝒟[𝐲r|𝐬k=1]𝔼𝒟[𝐲r|𝐬k=0]|.\mathrm{Data\;Association\;Bias}=\max_{k\in[m],\,r\in[c]}\big{|}\mathbb{E}_{% \mathcal{D}}\left[\mathbf{y}_{r}\,|\,\mathbf{s}_{k}=1\right]-\mathbb{E}_{% \mathcal{D}}\left[\mathbf{y}_{r}\,|\,\mathbf{s}_{k}=0\right]\big{|}.roman_Data roman_Association roman_Bias = roman_max start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] , italic_r ∈ [ italic_c ] end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 ] - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 ] | . (1)

Definition 2 is an extension of the widely-adopted notion of demographic parity (Dwork et al., 2012; Zafar et al., 2017; Alabdulmohsin & Lucic, 2021). It captures bias in second-order statistics. For example, if 𝒴𝒴\mathcal{Y}caligraphic_Y is the set of occupations and 𝒮𝒮\mathcal{S}caligraphic_S is perceived gender, association bias is large when an occupation is more prevalent among “perceived men” compared to “perceived women.”

Both types of bias in Definitions 1 and 2 are defined w.r.t. the data distribution. Next, we provide the analogous definitions for the model itself, which is always CLIP throughout our analysis.

Definition 3 (Model Representation Bias).

If f:𝒳Δmnormal-:𝑓normal-→𝒳superscriptnormal-Δ𝑚f:\mathcal{X}\to\Delta^{m}italic_f : caligraphic_X → roman_Δ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a classifier outputting a probability distribution over some sensitive attributes 𝒮𝒮\mathcal{S}caligraphic_S, the representation bias (RB) in f𝑓fitalic_f w.r.t. a fixed target π[0,1]m𝜋superscript01𝑚\pi\in[0,1]^{m}italic_π ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is: maxk[m]|πk𝔼𝒟[fk(𝐱)]|subscript𝑘delimited-[]𝑚subscript𝜋𝑘subscript𝔼𝒟delimited-[]subscript𝑓𝑘𝐱\max_{k\in[m]}|\pi_{k}-\mathbb{E}_{\mathcal{D}}[f_{k}(\mathbf{x})]|roman_max start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] end_POSTSUBSCRIPT | italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) ] |, where fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the probability assigned to attribute k𝑘kitalic_k.

In particular, if π𝜋\piitalic_π is uniform and the model assigns a higher probability to one group over others, on average, it has a larger representation bias. We care about RB because models should follow the principle of indifference (PI) (Eva, 2019) if images do not contain relevant evidence about subgroups.

Definition 4 (Model Association Bias).

If (𝐱,𝐬,𝐲)𝒳×𝒮×𝒴𝒟𝐱𝐬𝐲𝒳𝒮𝒴similar-to𝒟(\mathbf{x},\mathbf{s},\mathbf{y})\in\mathcal{X}\times\mathcal{S}\times% \mathcal{Y}\sim\mathcal{D}( bold_x , bold_s , bold_y ) ∈ caligraphic_X × caligraphic_S × caligraphic_Y ∼ caligraphic_D are drawn i.i.d., where the sensitive attribute 𝐬𝒮𝐬𝒮\mathbf{s}\in\mathcal{S}bold_s ∈ caligraphic_S is as in Definition 2, the association bias (AB) in f:𝒳𝒴normal-:𝑓normal-→𝒳𝒴f:\mathcal{X}\to\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y is:

ModelAssociationBias=maxk[m],r[c]|𝔼𝒟[fr(𝐱)|𝐬k=1]𝔼𝒟[fr(𝐱)|𝐬k=0]|.\mathrm{Model\;Association\;Bias}=\max_{k\in[m],\,r\in[c]}\big{|}\mathbb{E}_{% \mathcal{D}}\left[f_{r}(\mathbf{x})\,|\,\mathbf{s}_{k}=1\right]-\mathbb{E}_{% \mathcal{D}}\left[f_{r}(\mathbf{x})\,|\,\mathbf{s}_{k}=0\right]\big{|}.roman_Model roman_Association roman_Bias = roman_max start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] , italic_r ∈ [ italic_c ] end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) | bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 ] - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x ) | bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 ] | . (2)

Our goal is to study how such biases in the data transfer to their corresponding biases in the model, and whether data balancing is impactful, especially in light of the bias amplification phenomenon often observed in the literature (Bolukbasi et al., 2016; Hendricks et al., 2018; Zhao et al., 2017). For consistency, we focus on perceived gender as a sensitive attribute 𝐬𝐬\mathbf{s}bold_s and use occupations as the set of labels 𝒴𝒴\mathcal{Y}caligraphic_Y in Definition 3. We denote the original data (without intervention) as baseline.

Generally, to mitigate biases in the data, we explore two options. The first option is to tackle the desired constraints directly; i.e. by focusing only on perceived gender and occupations. We refer to this version of the data as balanced. The second option is to include proxies as well. For instance, CLIP may associate “pilots” with “perceived men” even if decorrelated in the data because cockpits resemble machines, and “machines” might be associated with “perceived men” elsewhere in the data. The set of proxies we use, such as “computer” and “briefcase,” is suggested by an LLM and listed in Appendix A.5. During training, we balance the marginal distribution of perceived gender and remove correlations with both occupation and proxies. We refer to this version of the data as proxies. In Appendix A.5.2, we motivate including proxies using the causal formalism in Veitch et al. (2021).

3 Summary of Findings

Before presenting the detailed experimental results, we first summarize our major findings:

  1. I

    Proxies mitigate representation biases: In addition to balancing the prevalence of sensitive attribute, decorrelating sensitive attributes against many proxy variables helps minimize the model’s RB, thereby preventing the model from favoring certain subgroups in unrelated contexts.

  2. II

    But proxies hurt association biases: For AB in the closed-vocabulary setting, e.g. removing correlation between gender and occupation, adding extra proxy variables can negatively affect attempts to reduce AB. This is likely because the added constraints compete with the original ones during data balancing when not all constraints can be satisfied simultaneously.

  3. III

    Fine-tuning is effective for representation bias: RB in the model is sensitive to the last distribution seen on the training data. So, fine-tuning on balanced data effectively counters it.

  4. IV

    But fine-tuning is less effective for association bias: The extent of AB in the model varies gradually based on the duration spent training on balanced data, irrespective of whether the balanced data is seen first or last during training.

  5. V

    Mixed impact on model performance: Data balancing changes the distribution of human and non-human images/texts, which leads to improvement on visual classification benchmarks but worse performance on image-text retrieval. We explain this in Section 4.4 and Appendix A.6.

  6. VI

    But improving data quality and model architecture helps: Improving data quality and model architecture, such as by filtering out image-text pairs with low similarity scores and using SigLIP (Zhai et al., 2023), seems to mitigate any potential negative impacts of data balancing on the model’s performance. For example, COCO image-to-text @5 improves from 86% without data balancing to 87%, and ImageNet 0-shot classification improves from 77% to 77.5%. See Section 4.4.

4 Detailed Results

Setup.

We assume a potentially-infinite set of examples (𝐱(i),𝐲(i),𝐬(i))isubscriptsuperscript𝐱𝑖superscript𝐲𝑖superscript𝐬𝑖𝑖{(\mathbf{x}^{(i)},\mathbf{y}^{(i)},\mathbf{s}^{(i)})}_{i\in\mathbb{N}}( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT, where 𝐱𝐱\mathbf{x}bold_x is an image-text pair, 𝐲{0, 1}c𝐲superscript01𝑐\mathbf{y}\in\{0,\,1\}^{c}bold_y ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT are some predefined labels and 𝐬{0, 1}m𝐬superscript01𝑚\mathbf{s}\in\{0,\,1\}^{m}bold_s ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT encodes sensitive attribute categories. See Section 5 for further details. We focus on perceived gender 𝐬𝐬\mathbf{s}bold_s and use occupations as the set of labels 𝒴𝒴\mathcal{Y}caligraphic_Y (see Definitions 1 and 2), both are inferred from image and text separately and then concatenated. For instance, to remove correlation between gender and occupation, we de-correlate all combinations of image- and text-based annotations. Since 𝐬𝐬\mathbf{s}bold_s and 𝐲𝐲\mathbf{y}bold_y are binary vectors, zero correlations imply independence so, by the data-processing inequality (Cover & Thomas, 1991), this offers a stronger bias guarantee than logical disjunction, where an attribute is marked present if detected in any modality. See Appendix A.1 for illustrations and additional discussions.

In order to explore the dynamic nature of how contrastive language-image pretraining (CLIP) learns and unlearns biases, we split training into two stages. In Stage 1, we train on either the original data without intervention or on the same dataset but after intervention. In Stage 2, we switch to the other version of the dataset. In total, each CLIP model is trained on 1B image-text pairs111Because debiasing removes about 10%percent1010\%10 % of the examples, models trained on debiased data see some examples twice. However, there is evidence that examples seen twice behave like fresh examples when training is not converged (Alabdulmohsin et al., 2022b) so we do not expect this difference to have an impact on the results. from the WebLI dataset (Chen et al., 2022). We vary the length of Stage 1 in {0%,10%,90%}percent0percent10percent90\{0\%,10\%,90\%\}{ 0 % , 10 % , 90 % }. We use three random seeds for each setup. Appendix A.4 provides the full training configuration.

In order to study the architecture’s impact, we experiment with two model sizes studied in Dosovitskiy et al. (2020) and Zhai et al. (2022a): (1) size S and size B with, respectively, 30M and 100M parameters for each modality. Besides, we also vary the representation size r{384,768}𝑟384768r\in\{384,768\}italic_r ∈ { 384 , 768 } (see Section 2). Since we use a patch size of 32×32323232\times 3232 × 32 in our sweep, we also verify key findings on larger sequence lengths using ViT-B/16 with 16×16161616\times 1616 × 16 patch sizes. We conduct a meta-analysis on all configurations to determine statistical significance utilizing the Wilcoxon’s signed-rank test (Wilcoxon, 1992).

4.1 Representation Bias

To evaluate RB in CLIP according to Definition 3, we use ILSRCV2012 (Deng et al., 2009) and compare the parity 𝔼[p(``man")p(``woman")]𝔼delimited-[]𝑝``man"𝑝``woman"\mathbb{E}[p(\mathrm{``man"})-p(\mathrm{``woman"})]blackboard_E [ italic_p ( ` ` roman_man " ) - italic_p ( ` ` roman_woman " ) ] across a random subset of 8K images. Intuitively, models should be indifferent to perceived gender in the absence of evidence, as discussed in Section 2.

Single Distribution.

Figure 2 (top) summarizes the results for models trained on a single distribution. First, we observe that the amount of training examples seems to offer little benefit; mean parity cannot be reduced by simply training on bigger datasets. Second, balancing the data – with or without proxies – helps in mitigating RB. In fact, adding proxies seems particularly beneficial for models with large representations trained on large datasets. Refer to Appendix A.7 for additional figures.

Refer to caption Refer to caption Data Size = 100M Data Size = 1B balanced proxies balanced proxies vs. baseline 0.008 0.008 0.039 0.008 vs. balanced \star 0.461 \star 0.742
Figure 2: top: Mean parity 𝔼[p(man)p(woman)]𝔼delimited-[]𝑝man𝑝woman\mathbb{E}[p(\mathrm{man})-p(\mathrm{woman})]blackboard_E [ italic_p ( roman_man ) - italic_p ( roman_woman ) ] across images from the ILSRCV2012 dataset (Deng et al., 2009). Values closer to zero are better. bottom: On left, parity scores for ViT-B/16 (longer visual sequence length). On right, p𝑝pitalic_p values calculated using Wilxocon’s signed rank test (Wilcoxon, 1992) for the null hypothesis that column has the same effect as row.
Refer to caption
Figure 3: CLIP is trained on 1B examples split into two stages. On the left, it is initially trained on intervened data with proxies, before switching to the original data. On the right, it is trained on the original data before intervening. Legends indicate the fraction of time [%] assigned to Stage 1.
Learning Dynamics.

Figure 3 displays the mean parity but for different durations of Stage 1. To recall, we split training into two stages in order to analyze how quickly CLIP learns and unlearns biases. As shown in Figure 3, the last distribution seen by the model, even if it is a meager 10% of the training duration, heavily impacts its parity. So, fine-tuning on balanced data is an effective remedy.

4.2 Association Bias

Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Data Size = 100M Data Size = 1B balanced proxies balanced proxies vs. baseline <106absentsuperscript106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT <106absentsuperscript106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT <106absentsuperscript106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT <106absentsuperscript106<10^{-6}< 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT vs. balanced \star 0.056 \star <104absentsuperscript104<10^{-4}< 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Figure 4: top: A comparison of AB (perceived gender against occupation) evaluated in three downstream datasets. bottom: ViT-B/16 results (left) and statistical analysis (right) as in Figure 2.
Refer to caption
Refer to caption
Refer to caption
Figure 5: A summary of how CLIP learns or unlearns association bias (shown in y𝑦yitalic_y-axis) when intervened data comprises different percentages [%] of training duration. Setup is similar to Figure 3.

Our second set of evaluations examine the association bias (AB) between perceived gender and occupations. We use FairFace (Karkkainen & Joo, 2021), UTKFace (Zhang et al., 2017) and MIAP (Schumann et al., 2021) datasets, all annotated with perceived gender attributes. The first two datasets contain face images while MIAP contains images more representative of real-world scenarios. We include MIAP because cultural biases extend beyond faces to artifacts and social practices (Berg et al., 2022). To evaluate AB, we calculate the mean absolute parity (1/|𝒴|)𝐲𝒴|p(𝐲|man)p(𝐲|woman)|(1/|\mathcal{Y}|)\sum_{\mathbf{y}\in\mathcal{Y}}|p(\mathbf{y}|\mathrm{man})-p(% \mathbf{y}|\mathrm{woman})|( 1 / | caligraphic_Y | ) ∑ start_POSTSUBSCRIPT bold_y ∈ caligraphic_Y end_POSTSUBSCRIPT | italic_p ( bold_y | roman_man ) - italic_p ( bold_y | roman_woman ) | across all occupations 𝒴𝒴\mathcal{Y}caligraphic_Y by providing CLIP with two labels: the occupation’s name vs. the empty string. Then, we average those for all images of perceived men and all images of perceived women. In MIAP, only images containing a single perceived gender are used.

Single Distribution.

Figure 4 (top) summarizes the average parity in models trained on a single data distribution. First, training longer increases the level of bias in baseline, likely because longer training allows the model to reflect biases in the data. But, we again observe that balancing data helps. Nevertheless, adding proxies seems to hurt, likely because the added constraints compete when not all constraints during data balancing can be satisfied simultaneously.

Learning Dynamics.

Figure 5 plots AB vs. the time of intervention. Unlike in RB (see Section 4.1), we now observe a gradual increase or decline in AB depending on how long the model is trained on intervened data, irrespective of whether the balanced data is seen first or last during training.

4.3 Recognizing Sensitive Attributes

The simplest way for a model to remove its bias is to be entirely blind or unaware of the sensitive attribute. For contrastive models, such outcome is undesirable since it impacts utility; e.g. the model may not distinguish between “father” or “mother” in the caption. To examine if the improvement in RB and AB due to data balancing impacts the ability of the model to recognize those attributes, we use the zero-shot classification setting with the two labels “man” and “woman” for each image in FairFace, UTKFace and MIAP. When aggregating errors using Wilcoxon’s signed rank test (Wilcoxon, 1992) with the null hypothesis that (baseline, balanced, proxies) yield the same performance, we have p>0.05𝑝0.05p>0.05italic_p > 0.05, indicating the absence of statistically significant differences. See Appendix A.7.

4.4 Model Quality

Table 1: 10-shot classification results using ViT-B/16 as image tower in CLIP, pretrained on 1B examples. Datasets are ILSRCV2012 (Deng et al., 2009), Colorectal (Kather et al., 2016), Cars (Krause et al., 2013), Birds (Welinder et al., 2010), UC (Yang & Newsam, 2010), CIFAR100 (Krizhevsky, 2009), DTD (Cimpoi et al., 2014), Caltech (Fei-Fei et al., 2004) and Pets (Parkhi et al., 2012).

INet

Color

Cars

Birds

UC

C100

DTD

Caltech

Pets

baseline|

50.4±.1superscript50.4plus-or-minus.150.4^{\pm.1}50.4 start_POSTSUPERSCRIPT ± .1 end_POSTSUPERSCRIPT

75.7±.8superscript75.7plus-or-minus.8\bf 75.7^{\pm.8}bold_75.7 start_POSTSUPERSCRIPT ± bold_.8 end_POSTSUPERSCRIPT

77.5±.3superscript77.5plus-or-minus.377.5^{\pm.3}77.5 start_POSTSUPERSCRIPT ± .3 end_POSTSUPERSCRIPT

46.8±.6superscript46.8plus-or-minus.646.8^{\pm.6}46.8 start_POSTSUPERSCRIPT ± .6 end_POSTSUPERSCRIPT

91.3±.1superscript91.3plus-or-minus.191.3^{\pm.1}91.3 start_POSTSUPERSCRIPT ± .1 end_POSTSUPERSCRIPT

57.8±.6superscript57.8plus-or-minus.6\bf 57.8^{\pm.6}bold_57.8 start_POSTSUPERSCRIPT ± bold_.6 end_POSTSUPERSCRIPT

66.9±.1superscript66.9plus-or-minus.166.9^{\pm.1}66.9 start_POSTSUPERSCRIPT ± .1 end_POSTSUPERSCRIPT

88.9±.1superscript88.9plus-or-minus.188.9^{\pm.1}88.9 start_POSTSUPERSCRIPT ± .1 end_POSTSUPERSCRIPT

71.9±.1superscript71.9plus-or-minus.171.9^{\pm.1}71.9 start_POSTSUPERSCRIPT ± .1 end_POSTSUPERSCRIPT

balanced|

50.9±.1superscript50.9plus-or-minus.1\bf 50.9^{\pm.1}bold_50.9 start_POSTSUPERSCRIPT ± bold_.1 end_POSTSUPERSCRIPT

74.9±.6superscript74.9plus-or-minus.674.9^{\pm.6}74.9 start_POSTSUPERSCRIPT ± .6 end_POSTSUPERSCRIPT

79.0±.1superscript79.0plus-or-minus.1\bf 79.0^{\pm.1}bold_79.0 start_POSTSUPERSCRIPT ± bold_.1 end_POSTSUPERSCRIPT

47.6±.2superscript47.6plus-or-minus.2\bf 47.6^{\pm.2}bold_47.6 start_POSTSUPERSCRIPT ± bold_.2 end_POSTSUPERSCRIPT

92.2±.3superscript92.2plus-or-minus.3\bf 92.2^{\pm.3}bold_92.2 start_POSTSUPERSCRIPT ± bold_.3 end_POSTSUPERSCRIPT

57.4±.9superscript57.4plus-or-minus.9\bf 57.4^{\pm.9}bold_57.4 start_POSTSUPERSCRIPT ± bold_.9 end_POSTSUPERSCRIPT

67.4±.3superscript67.4plus-or-minus.3\bf 67.4^{\pm.3}bold_67.4 start_POSTSUPERSCRIPT ± bold_.3 end_POSTSUPERSCRIPT

89.2±.3superscript89.2plus-or-minus.389.2^{\pm.3}89.2 start_POSTSUPERSCRIPT ± .3 end_POSTSUPERSCRIPT

73.3±.4superscript73.3plus-or-minus.473.3^{\pm.4}73.3 start_POSTSUPERSCRIPT ± .4 end_POSTSUPERSCRIPT

proxies|

51.1±.1superscript51.1plus-or-minus.1\bf 51.1^{\pm.1}bold_51.1 start_POSTSUPERSCRIPT ± bold_.1 end_POSTSUPERSCRIPT

75.0±.6superscript75.0plus-or-minus.675.0^{\pm.6}75.0 start_POSTSUPERSCRIPT ± .6 end_POSTSUPERSCRIPT

78.7±.2superscript78.7plus-or-minus.2\bf 78.7^{\pm.2}bold_78.7 start_POSTSUPERSCRIPT ± bold_.2 end_POSTSUPERSCRIPT

47.7±.1superscript47.7plus-or-minus.1\bf 47.7^{\pm.1}bold_47.7 start_POSTSUPERSCRIPT ± bold_.1 end_POSTSUPERSCRIPT

91.5±.3superscript91.5plus-or-minus.391.5^{\pm.3}91.5 start_POSTSUPERSCRIPT ± .3 end_POSTSUPERSCRIPT

57.2±.2superscript57.2plus-or-minus.2\bf 57.2^{\pm.2}bold_57.2 start_POSTSUPERSCRIPT ± bold_.2 end_POSTSUPERSCRIPT

67.0±.2superscript67.0plus-or-minus.267.0^{\pm.2}67.0 start_POSTSUPERSCRIPT ± .2 end_POSTSUPERSCRIPT

89.6±.1superscript89.6plus-or-minus.1\bf 89.6^{\pm.1}bold_89.6 start_POSTSUPERSCRIPT ± bold_.1 end_POSTSUPERSCRIPT

73.4±.2superscript73.4plus-or-minus.2\bf 73.4^{\pm.2}bold_73.4 start_POSTSUPERSCRIPT ± bold_.2 end_POSTSUPERSCRIPT

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: top: Zero-shot classification for ILSRCV2012, CIFAR100, and Pets. bottom: Retrieval results (image-to-text and text-to-image) for COCO and Flickr. See Appendix A.7 for full results.

Next, we compare the quality of CLIP models across three downstream evaluations: zero-shot classification, few-shot image classification using the representation provided by the image tower, and retrieval for both COCO (Lin et al., 2015) and Flickr (Young et al., 2014). Here, we report results for ViT-B/16 image tower and defer the full set of figures to Appendix A.7. As shown in Appendix A.7, Table 1 and Figure 6, balancing the data improves classification, on average, but hurts retrieval. In addition, the impact on each metric is statistically significant with p<105𝑝superscript105p<10^{-5}italic_p < 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. In Appendix A.6, we conduct an in-depth analysis, which reveals that the impact on quality is attributed to the distribution shift of human and non-human image-text pairs. In particular, debiasing reduces the number of examples containing humans, which improves visual classification since most benchmarks like ImageNet contain few (if any) human images. By contrast, retrieval datasets, such as COCO, contain a significant number (>40%absentpercent40>40\%> 40 %). We analyze and reproduce the effect of debiasing in A.6.

Data Quality & Architectural Improvement. Next, we investigate the impact of data quality and architectural improvements on model performance. Rather than training CLIP on minimally filtered image-text pairs (where filtering primarily addresses personally identifiable information (PII) and inappropriate content as described in (Chen et al., 2022), we introduce an additional quality filtering step. This step leverages pretrained models to assess the semantic alignment between image and text content. Image-text pairs with low alignment scores are discarded.

In this evaluation, we use the recently-introduced Sigmoid-loss Language Image Pretraining (SigLIP) (Zhai et al., 2023). We train a two-tower SigLIP with ViT-B/16 image encoder and ViT-B text encoder for 10B examples with and without data balancing (with proxies). Because SigLIP does not provide calibrated probability scores, we evaluate association bias (AB) in the model between gender and occupation by drawing one image of a man and one image of a woman unifromly at random, and asking the model to predict which of the two images contains a person with a given occupation (e.g. “nurse”). Parity is then defined to be the difference in probability that the model chooses one particular gender over the other: |p(man)p(woman)|𝑝man𝑝woman|p(\mathrm{man})-p(\mathrm{woman})|| italic_p ( roman_man ) - italic_p ( roman_woman ) |.

Table 2 provides the maximum observed parity in each model across all occupations. Clearly, data balancing has a positive impact on the model’s bias, in agreement with our earlier results. Interestingly, however, we do not observe a negative impact on the model quality this time. In fact, the quality of the model seems to improve in all metrics, as shown in Table 3. This suggests that data quality and architectural improvements can mitigate the potential negative impact of data balancing, allowing us to reduce biases in the model without impacting its quality.

Table 2: Association bias (AB) between gender and occupation in SigLIP models when they are pretrained with or without data balancing on a high-quality subset of WebLI (Chen et al., 2022).
Data FairFace UTK MIAP

Without Intervention

38.838.838.838.8

38.238.238.238.2

28.428.428.428.4

With Intervention

29.929.9\bf 29.9bold_29.9

33.733.7\bf 33.7bold_33.7

20.520.5\bf 20.5bold_20.5

Table 3: Classification and retrieval evaluation results [%] in SigLIP ViT-B/16 with and without data balancing. Data balancing seems to improve the model quality in this setting. See Section 4.4.
Data ImageNet COCO Retrieval Flickr Retrieval
10-shot 0-shot img2txt@5 txt2img@5 img2txt@5 txt2img@5

Without

70.770.770.770.7

77.077.077.077.0

86.086.086.086.0

73.473.473.473.4

98.598.598.598.5

94.394.394.394.3

Intervention

With

71.471.4\bf 71.4bold_71.4

77.577.5\bf 77.5bold_77.5

87.187.1\bf 87.1bold_87.1

73.773.7\bf 73.7bold_73.7

98.898.8\bf 98.8bold_98.8

94.994.9\bf 94.9bold_94.9

Intervention

5 Multi-Modal Moment Matching (M4)

Denote 𝒮:|𝒮|=m:𝒮𝒮𝑚\mathcal{S}:|\mathcal{S}|=mcaligraphic_S : | caligraphic_S | = italic_m and 𝒴:|𝒴|=c:𝒴𝒴𝑐\mathcal{Y}:|\mathcal{Y}|=ccaligraphic_Y : | caligraphic_Y | = italic_c for the set of attributes and labels respectively. To mitigate representation and association biases (see Section 2), we reweigh examples in the data. It can be shown that finding a set of weights 𝐪𝐪\mathbf{q}bold_q assigned to training examples, which mitigate the two types of bias, is equivalent to finding a feasible solution 𝐪𝐪\mathbf{q}bold_q to the following constraints (see Appendix A.2.1):

k[m],r[c]|𝔼𝒟[𝐪(𝐬kπk)𝐲r]|ϵDk[m]|𝔼𝒟[𝐪(𝐬kπk)]|ϵR,formulae-sequencesubscriptfor-allformulae-sequence𝑘delimited-[]𝑚𝑟delimited-[]𝑐subscript𝔼𝒟delimited-[]𝐪subscript𝐬𝑘subscript𝜋𝑘subscript𝐲𝑟subscriptitalic-ϵ𝐷subscriptfor-all𝑘delimited-[]𝑚subscript𝔼𝒟delimited-[]𝐪subscript𝐬𝑘subscript𝜋𝑘subscriptitalic-ϵ𝑅\forall_{k\in[m],\,r\in[c]}\left|\mathbb{E}_{\mathcal{D}}\left[\mathbf{q}\cdot% \left(\mathbf{s}_{k}-\pi_{k}\right)\cdot\mathbf{y}_{r}\right]\right|\leq% \epsilon_{D}\quad\wedge\quad\forall_{k\in[m]}\left|\mathbb{E}_{\mathcal{D}}% \left[\mathbf{q}\cdot\left(\mathbf{s}_{k}-\pi_{k}\right)\right]\right|\leq% \epsilon_{R},∀ start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] , italic_r ∈ [ italic_c ] end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_q ⋅ ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] | ≤ italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∧ ∀ start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_q ⋅ ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] | ≤ italic_ϵ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , (3)

for some small tolerance levels ϵDsubscriptitalic-ϵ𝐷\epsilon_{D}italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and ϵRsubscriptitalic-ϵ𝑅\epsilon_{R}italic_ϵ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. The intuition behind this follows from the fact that since both 𝐲rsubscript𝐲𝑟\mathbf{y}_{r}bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and 𝐬ksubscript𝐬𝑘\mathbf{s}_{k}bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are binary-valued, zero covariance implies independence. Because finding a solution when 𝐪{0,1}𝐪01\mathbf{q}\in\{0,1\}bold_q ∈ { 0 , 1 } is NP-hard (Mehrotra & Celis, 2021), we relax the problem by optimizing 𝐪𝐪\mathbf{q}bold_q within the unit interval [0,1]01[0,1][ 0 , 1 ] and we sub-sample from the data according to it. However, since the same algorithm can potentially be used in-processing, with 𝐪𝐪\mathbf{q}bold_q taking any value in +superscript\mathbb{R}^{+}blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, we make our treatment general by adding the constraints 0𝐪Q0𝐪𝑄0\leq\mathbf{q}\leq Q0 ≤ bold_q ≤ italic_Q for some Q+{}𝑄superscriptQ\in\mathbb{R}^{+}\cup\{\infty\}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∪ { ∞ }.

In practice, it can sometimes be useful to constrain the average size of 𝐪𝐪\mathbf{q}bold_q as well. For example, when subsampling from a dataset, 𝔼[𝐪]𝔼delimited-[]𝐪\mathbb{E}[\mathbf{q}]blackboard_E [ bold_q ] must be equal to the subsampling rate η(0,1]𝜂01\eta\in(0,1]italic_η ∈ ( 0 , 1 ]. In addition, a common assumption in fair subset selection is to allow examples to have different utilities 𝐮0𝐮0\mathbf{u}\geq 0bold_u ≥ 0 (Stoyanovich et al., 2018); e.g. based on video engagement or text/image quality. Finally, bias constraints should correspond to a soft penalty to accommodate cases when they are not feasible, which can occur even when ϵD,ϵR>0subscriptitalic-ϵ𝐷subscriptitalic-ϵ𝑅0\epsilon_{D},\epsilon_{R}>0italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT > 0. We incorporate all such features into the algorithm.

Hyperparameters:
η++𝜂superscriptabsent\eta\in\mathbb{R}^{++}italic_η ∈ blackboard_R start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT : average weight size
Q>η++𝑄𝜂superscriptabsentQ>\eta\in\mathbb{R}^{++}italic_Q > italic_η ∈ blackboard_R start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT : maximum weight
V++𝑉superscriptabsentV\in\mathbb{R}^{++}italic_V ∈ blackboard_R start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT : enforcement level
τ++𝜏superscriptabsent\tau\in\mathbb{R}^{++}italic_τ ∈ blackboard_R start_POSTSUPERSCRIPT + + end_POSTSUPERSCRIPT : learning rate
ϵD,ϵR[0,1)subscriptitalic-ϵ𝐷subscriptitalic-ϵ𝑅01\epsilon_{D},\epsilon_{R}\in[0,1)italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∈ [ 0 , 1 ) : bias constraint levels
Update:
1:𝐚biasVector(𝐬,𝐲,π,ϵD,ϵR)𝐚biasVector𝐬𝐲𝜋subscriptitalic-ϵ𝐷subscriptitalic-ϵ𝑅\mathbf{a}\;\leftarrow\;\mathrm{biasVector}(\mathbf{s},\,\mathbf{y},\,\pi,% \epsilon_{D},\,\epsilon_{R})bold_a ← roman_biasVector ( bold_s , bold_y , italic_π , italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT )
2:β[vT𝐚+μη𝐮]+𝛽superscriptdelimited-[]superscript𝑣𝑇𝐚𝜇𝜂𝐮\beta\;\leftarrow\;[v^{T}\mathbf{a}+\mu-\eta\mathbf{u}]^{+}italic_β ← [ italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_a + italic_μ - italic_η bold_u ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT
3:α[𝐮(ηQ)vT𝐚μ]+𝛼superscriptdelimited-[]𝐮𝜂𝑄superscript𝑣𝑇𝐚𝜇\alpha\;\leftarrow\;[\mathbf{u}(\eta-Q)-v^{T}\mathbf{a}-\mu]^{+}italic_α ← [ bold_u ( italic_η - italic_Q ) - italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_a - italic_μ ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT
4:𝐪η(vT𝐚+μ+αβ)/𝐮𝐪𝜂superscript𝑣𝑇𝐚𝜇𝛼𝛽𝐮\mathbf{q}\;\leftarrow\;\eta-(v^{T}\mathbf{a}+\mu+\alpha-\beta)/\mathbf{u}bold_q ← italic_η - ( italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_a + italic_μ + italic_α - italic_β ) / bold_u
5:vv+τ𝐪η𝐚𝑣𝑣𝜏𝐪𝜂𝐚v\;\leftarrow\;v+\tau\,\frac{\mathbf{q}}{\eta}\,\mathbf{a}italic_v ← italic_v + italic_τ divide start_ARG bold_q end_ARG start_ARG italic_η end_ARG bold_a
6:μμ+τ(𝐪η1)𝜇𝜇𝜏𝐪𝜂1\mu\;\leftarrow\;\mu+\tau\,\left(\frac{\mathbf{q}}{\eta}-1\right)italic_μ ← italic_μ + italic_τ ( divide start_ARG bold_q end_ARG start_ARG italic_η end_ARG - 1 )
Algorithm 1 Update step per example
  def biasVector(s, z, pi, epsd, epsr):
    """
␣␣␣␣Args:
␣␣␣␣␣s:sensitiveattributes,oflengthm.
␣␣␣␣␣z:labels,oflengthc.
␣␣␣␣␣pi:targetdistribution,oflengthm.
␣␣␣␣␣epsd&epsr:biasconstraintlevels.
␣␣␣␣Returns:
␣␣␣␣␣b:biasvectoroflength2m(c+1).
␣␣␣␣"""
    dp = np.einsum("bi,bo->bio", s - pi, z).reshape((batch_size, -1))
    b = np.concatenate([
        dp - epsd, -dp - epsd,  # AB
        s - pi - epsr, -s + pi -epsr  # RB
      ], axis=1)
    return b
Algorithm 2 Bias vector implementation.
Figure 7: Pseudo-code of the data balancing algorithm in Section 5. left: Single update per example (𝐬,𝐲,𝐮)𝐬𝐲𝐮(\mathbf{s},\mathbf{y},\mathbf{u})( bold_s , bold_y , bold_u ), where 𝐮𝐮\mathbf{u}bold_u is the example’s utility. right: Numpy-like implementation of the bias vector 𝐚𝐚\mathbf{a}bold_a.

An overview of the data balancing algorithm is shown in Figure 7. It maintains two optimization variables v2m(c+1)𝑣superscript2𝑚𝑐1v\in\mathbb{R}^{2m(c+1)}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_m ( italic_c + 1 ) end_POSTSUPERSCRIPT and μ𝜇\mu\in\mathbb{R}italic_μ ∈ blackboard_R, which are used to calculate the sample weight 𝐪𝐪\mathbf{q}bold_q by solving:

minimize𝔼[𝐪]=η 0𝐪Q{12𝔼𝒟[𝐮(𝐪η)2]+V(k[m]lkR+k[m],r[c]lk,rD)},subscriptminimize𝔼delimited-[]𝐪𝜂 0𝐪𝑄12subscript𝔼𝒟delimited-[]𝐮superscript𝐪𝜂2𝑉subscript𝑘delimited-[]𝑚subscriptsuperscript𝑙𝑅𝑘subscriptformulae-sequence𝑘delimited-[]𝑚𝑟delimited-[]𝑐subscriptsuperscript𝑙𝐷𝑘𝑟\operatorname*{minimize}_{\mathbb{E}[\mathbf{q}]=\eta\;\wedge\;0\leq\mathbf{q}% \leq Q}\quad\left\{\frac{1}{2}\,\mathbb{E}_{\mathcal{D}}\left[\mathbf{u}\cdot% \left(\mathbf{q}-\eta\right)^{2}\right]+V\cdot\left(\sum_{k\in[m]}l^{R}_{k}+% \sum_{k\in[m],\,r\in[c]}l^{D}_{k,r}\right)\right\},roman_minimize start_POSTSUBSCRIPT blackboard_E [ bold_q ] = italic_η ∧ 0 ≤ bold_q ≤ italic_Q end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_u ⋅ ( bold_q - italic_η ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_V ⋅ ( ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] end_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_m ] , italic_r ∈ [ italic_c ] end_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_r end_POSTSUBSCRIPT ) } , (4)

where lk,rD=max{0,|𝔼𝒟[𝐪(𝐬kπk)𝐲r]|ϵDP}superscriptsubscript𝑙𝑘𝑟𝐷0subscript𝔼𝒟delimited-[]𝐪subscript𝐬𝑘subscript𝜋𝑘subscript𝐲𝑟subscriptitalic-ϵ𝐷𝑃l_{k,r}^{D}=\max\left\{0,\,\left|\mathbb{E}_{\mathcal{D}}\left[\mathbf{q}\cdot% \left(\mathbf{s}_{k}-\pi_{k}\right)\cdot\mathbf{y}_{r}\right]\right|-\epsilon_% {DP}\right\}italic_l start_POSTSUBSCRIPT italic_k , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = roman_max { 0 , | blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_q ⋅ ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] | - italic_ϵ start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT } and lkR=max{0,|𝔼𝒟[𝐪(𝐬kπk)]|ϵR}superscriptsubscript𝑙𝑘𝑅0subscript𝔼𝒟delimited-[]𝐪subscript𝐬𝑘subscript𝜋𝑘subscriptitalic-ϵ𝑅l_{k}^{R}=\max\{0,\,\left|\mathbb{E}_{\mathcal{D}}\left[\mathbf{q}\cdot\left(% \mathbf{s}_{k}-\pi_{k}\right)\right]\right|-\epsilon_{R}\}italic_l start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT = roman_max { 0 , | blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ bold_q ⋅ ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] | - italic_ϵ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT } are the violations to the bias constraints in (3). The first term in (4) encourages the weights to be close to η𝜂\etaitalic_η since we have the constraint 𝔼[q]=η𝔼delimited-[]𝑞𝜂\mathbb{E}[q]=\etablackboard_E [ italic_q ] = italic_η while assigning higher priority to examples with greater utility 𝐮𝐮\mathbf{u}bold_u. The second term penalizes biases with V>0𝑉0V>0italic_V > 0 controlling bias enforcement levels.

Proposition 1.

Algorithm 1 terminates with an optimal solution to the optimization problem in (4).

The proof is in Appendix A.2.1. At inference time, the weight 𝐪𝐪\mathbf{q}bold_q assigned to a new example is:

𝐪=η1𝐮(vT𝐚+μ+[𝐮(ηQ)vT𝐚μ]+[vT𝐚+μη𝐮]+).𝐪𝜂1𝐮superscript𝑣𝑇𝐚𝜇superscriptdelimited-[]𝐮𝜂𝑄superscript𝑣𝑇𝐚𝜇superscriptdelimited-[]superscript𝑣𝑇𝐚𝜇𝜂𝐮\mathbf{q}=\eta-\frac{1}{\mathbf{u}}\left(v^{T}\mathbf{a}+\mu+\left[\mathbf{u}% \left(\eta-Q\right)-v^{T}\mathbf{a}-\mu\right]^{+}-\left[v^{T}\mathbf{a}+\mu-% \eta\mathbf{u}\right]^{+}\right).bold_q = italic_η - divide start_ARG 1 end_ARG start_ARG bold_u end_ARG ( italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_a + italic_μ + [ bold_u ( italic_η - italic_Q ) - italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_a - italic_μ ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - [ italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_a + italic_μ - italic_η bold_u ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) . (5)

Here, 𝐚𝐚\mathbf{a}bold_a is a “bias vector” calculated from the sensitive attributes 𝐬𝐬\mathbf{s}bold_s and labels 𝐲𝐲\mathbf{y}bold_y (see Appendix A.2.1). We provide examples of how the algorithm works and empirical verification in Appendix A.2.3. We also compare Algorithm 1 against other debiasing methods for binary classification in Appendix A.2.4.

Proposition 2.

Starting from the initial values v=0𝑣0v=0italic_v = 0 and u=0𝑢0u=0italic_u = 0, let Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the dual loss of the optimization problem in (4) after t𝑡titalic_t updates of Algorithm 1 and denote Fsubscript𝐹F_{\infty}italic_F start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT for its limiting value. Then, Algorithm 1 with the learning rate schedule τt=O(1/t)subscript𝜏𝑡𝑂1𝑡\tau_{t}=O(1/\sqrt{t})italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_O ( 1 / square-root start_ARG italic_t end_ARG ) satisfies: |mint𝔼[Ft]F|O((Qmcη)2logtt)subscript𝑡𝔼delimited-[]subscript𝐹𝑡subscript𝐹𝑂superscript𝑄𝑚𝑐𝜂2𝑡𝑡\big{|}\min_{t}\;\mathbb{E}[F_{t}]-F_{\infty}\big{|}\leq O\left(\left(\frac{% Qmc}{\eta}\right)^{2}\frac{\log t}{\sqrt{t}}\right)| roman_min start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_E [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - italic_F start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT | ≤ italic_O ( ( divide start_ARG italic_Q italic_m italic_c end_ARG start_ARG italic_η end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_log italic_t end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG ).

The proof is in Appendix A.2.2. Proposition 2 states that Algorithm 1 needs, at most, O(Qmc/η)4)O(Qmc/\eta)^{4})italic_O ( italic_Q italic_m italic_c / italic_η ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) examples to converge. Since m=|𝒮|𝑚𝒮m=|\mathcal{S}|italic_m = | caligraphic_S | and c=|𝒴|𝑐𝒴c=|\mathcal{Y}|italic_c = | caligraphic_Y | are typically small, convergence is fast. Throughout our experiments, we use the weights 𝐪𝐪\mathbf{q}bold_q to subsample from the original dataset. The subsampling rate (η𝜂\etaitalic_η in Algorithm 1) is chosen to be the maximum rate where bias constraints are still satisfiable, which happens to be 90% in our setup. We use subsampling, instead of reweighting, because subsampling tends to perform more favorably (Sagawa et al., 2020; Celis et al., 2018).

In Appendix A.2, we provide an empirical evaluation of the algorithm against other preprocessing methods in the literature, when all are applied in the classical supervised binary classification setting.

6 Related Works

Fairness in Multimodal Systems. Fairness is a social construct and an important consideration when evaluating machine learning models. Research has shown that in the absence of bias mitigation, machine learning systems can amplify societal stereotypes (Hendricks et al., 2018; Bolukbasi et al., 2016; Caliskan et al., 2017; Yang et al., 2020), cause performance disparities (Buolamwini & Gebru, 2018; Deuschel et al., 2020) and encode cultural biases and perspectives (Hutchinson et al., 2022; DeVries et al., 2019). For multimodal systems, in particular, while there is a growing interest in their applications and datasets, such as LAION (Schuhmann et al., 2021) and WebLI (Chen et al., 2022), recent findings indicate that the use of multiple modalities not only continues to encode societal biases (Hutchinson et al., 2022; Wang et al., 2023) including in CLIP models (Hall et al., 2023; Wang et al., 2022), but can also amplify them further compared to unimodal systems (Booth et al., 2021). This includes not only text-to-image generative models and CLIP but also image captioning (Zhao et al., 2021; Tang et al., 2021). In particular, multimodal datasets, such as LAION, were found to contain problematic content, such as stereotypes and ethnic slurs (Birhane et al., 2021). Few methods have been proposed for mitigating such biases including adversarial training (Yan et al., 2020; Berg et al., 2022), projection (Wang et al., 2023) and dropout (Wang et al., 2021). Yet, we are not aware of prior works that examine the effectiveness of data balancing, such as in CLIP (Radford et al., 2021).

Reweighting methods. The data balancing algorithm we develop is a variant of reweighting methods. Because it does not alter examples (unlike fair representation learning such as Zemel et al. (2013); Feldman et al. (2015); Lum & Johndrow (2016); Calmon et al. (2017); Madras et al. (2018)) and does not alter labels (unlike methods such as Kamiran & Calders (2009) and Alabdulmohsin et al. (2022a)), it serves as a viable baseline for debiasing CLIP-style models. However, there has been some skepticism in parts of the literature about the efficacy of reweighting in overparameterized models (Byrd & Lipton, 2019). Intuitively, an overparameterized model has the potential to perfectly fit all training examples, rendering it optimal regardless of the sample weights used. Nonetheless, in the multimodal setting, the size of the training dataset can be extremely large (i.e., billions of examples), and the model is only trained for a few epochs. In this setup, re-weighting the data may can be impactful as we demonstrate in our experiments. In the traditional supervised learning setup, there is evidence that reweighting is competitive (Chen et al., 2018; Idrissi et al., 2022; Choi et al., 2020), and that subsampling performs more favorably (Sagawa et al., 2020; Celis et al., 2018). Analogous techniques have additionally been used in other contexts, including texts (Dixon et al., 2018) and fine-tuning on balanced data to mitigate spurious correlations (Kirichenko et al., 2022).

Advantages of M4. Our data diversity algorithm offers more flexibility than prior methods by relaxing many of their constraints. For instance, it accommodates an arbitrary number of overlap** groups and attributes, making it also capable of handling traditional multiclass settings for which few black-box debiasing algorithms exist and they have only recently been developed (Alghamdi et al., 2022; Alabdulmohsin et al., 2022a). Diversity sampling has a well-established history, sometimes referred to as “fair subset selection” (Drosou et al., 2017; Mehrotra & Celis, 2021). It originates from early observations in search that sub-sampling data by maximizing utility leads to an under-representation of minorities (Kay et al., 2015). However, prior works, such as Stoyanovich et al. (2018), Yan et al. (2020) and Mehrotra & Celis (2021), only address representational biases (i.e., first-order statistics) and fail to account for association biases (i.e., second-order statistics). Other data balancing algorithms such as in Mehrotra & Celis (2021) even solve LPs, making them prohibitive for Internet-scale multimodal data. The data balancing algorithm we introduce resolves such limitations.

7 Conclusion and Future Work

We present a data balancing algorithm and use it to study the impact of mitigating biases in Contrastive Language Image Pretraining (CLIP), a popular training paradigm used in several domains. We define representation and association bias in data and model separately and carefully disentangle the effects of the model, data, and representation size. Our findings suggest that balancing data is impactful but insufficient for obtaining fair downstream behavior so it should be combined with other intervention methods, such as in- and post-processing. In addition, it is recommended that models are trained on balanced data from the outset given that fine-tuning is less effective in removing association biases. Finally, the impact on quality should be assessed for both human-related metrics (e.g. describing actions) and non-human related metrics (e.g. classifying objects) since they can behave differently. Nevertheless, our analysis is necessarily limited since fairness cannot be reduced to statistical metrics. For example, we do not discuss which sensitive attributes are relevant or what acceptable label associations might be, and we only focus on contrastive (not generative) models. Potential future directions include exploring data augmentation methods and validating the effectiveness of our work on datasets such as Casual Conversations (Hazirbas et al., 2021), which have self-identified labels.

References

  • Agarwal et al. (2018) Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A reductions approach to fair classification. In ICML, 2018.
  • Agostinelli et al. (2023) Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. Musiclm: Generating music from text, 2023.
  • Akyürek et al. (2022) Afra Feyza Akyürek, Muhammed Yusuf Kocyigit, Se** Paik, and Derry Wijaya. Challenges in measuring bias via open-ended language generation, 2022.
  • Alabdulmohsin & Lucic (2021) Ibrahim Alabdulmohsin and Mario Lucic. A near optimal algorithm for debiasing trained machine learning models. In NeurIPS, 2021.
  • Alabdulmohsin et al. (2022a) Ibrahim Alabdulmohsin, Jessica Schrouff, and Oluwasanmi Koyejo. A reduction to binary approach for debiasing multiclass datasets. In NeurIPS, 2022a.
  • Alabdulmohsin et al. (2022b) Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. NeurIPS, 2022b.
  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • Alghamdi et al. (2022) Wael Alghamdi, Hsiang Hsu, Haewon Jeong, Hao Wang, P Winston Michalak, Shahab Asoodeh, and Flavio P Calmon. Beyond adult and compas: Fairness in multi-class prediction. In NeurIPS, 2022.
  • Berg et al. (2022) Hugo Berg, Siobhan Mackenzie Hall, Yash Bhalgat, Wonsuk Yang, Hannah Rose Kirk, Aleksandar Shtedritski, and Max Bain. A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
  • Beyer et al. (2022) Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/big_vision, 2022.
  • Bigham et al. (2006) Jeffrey P Bigham, Ryan S Kaminsky, Richard E Ladner, Oscar M Danielsson, and Gordon L Hempton. Webinsight: making web images accessible. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility, pp.  181–188, 2006.
  • Birhane et al. (2021) Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
  • Blake & Merz (1998) C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
  • Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NeurIPS, 2016.
  • Booth et al. (2021) Brandon M Booth, Louis Hickman, Shree Krishna Subburaj, Louis Tay, Sang Eun Woo, and Sidney K D’Mello. Bias and fairness in multimodal machine learning: A case study of automated video interviews. In Proceedings of the 2021 International Conference on Multimodal Interaction, pp.  268–277, 2021.
  • Boyd & Mutapcic (2008) Stephen Boyd and Almir Mutapcic. Stochastic subgradient methods. 2008. URL https://see.stanford.edu/materials/lsocoee364b/04-stoch_subgrad_notes.pdf.
  • Boyd & Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
  • Bradbury et al. (2018) James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  • Buolamwini & Gebru (2018) Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAccT, 2018.
  • Byrd & Lipton (2019) Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  872–881. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/byrd19a.html.
  • Caliskan et al. (2017) Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 2017.
  • Calmon et al. (2017) Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized pre-processing for discrimination prevention. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/9a49a25d845a483fae4be7e341368e36-Paper.pdf.
  • Celis et al. (2018) Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth Vishnoi. Fair and diverse DPP-based data summarization. In Jennifer Dy and Andreas Krause (eds.), ICML, volume 80 of Proceedings of Machine Learning Research, pp.  716–725. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/celis18a.html.
  • Chang et al. (2023) Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. Muse: Text-to-image generation via masked generative transformers, 2023.
  • Chen et al. (2018) Irene Chen, Fredrik D Johansson, and David Sontag. Why is my classifier discriminatory? NeurIPS, 31, 2018.
  • Chen et al. (2022) Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  • Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • Cho et al. (2022) Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative models, 2022.
  • Choi et al. (2020) Kristy Choi, Aditya Grover, Trisha Singh, Rui Shu, and Stefano Ermon. Fair generative modeling via weak supervision. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  1887–1898. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/choi20a.html.
  • Cimpoi et al. (2014) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  • Cover & Thomas (1991) Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley & Sons, 1991.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • Deuschel et al. (2020) Jessica Deuschel, Bettina Finzel, and Ines Rieger. Uncovering the bias in facial expressions. arXiv preprint arXiv:2011.11311, 2020.
  • DeVries et al. (2019) Terrance DeVries, Ishan Misra, Changhan Wang, and Laurens van der Maaten. Does object recognition work for everyone?, 2019.
  • Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Measuring and mitigating unintended bias in text classification. In Conference on AI, Ethics, and Society, 2018.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  • Drosou et al. (2017) Marina Drosou, Hosagrahar V Jagadish, Evaggelia Pitoura, and Julia Stoyanovich. Diversity in big data: A review. Big data, 5(2):73–84, 2017.
  • Dudik et al. (2020) Miroslav Dudik, Richard Edgar, Brandon Horn, and Roman Lutz. fairlearn 0.4.6. 2020. URL https://pypi.org/project/fairlearn/.
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science, 2012.
  • Eva (2019) Benjamin Eva. Principles of indifference. The Journal of Philosophy, 116(7):390–411, 2019.
  • Fabris et al. (2022) Alessandro Fabris, Stefano Messina, Gianmaria Silvello, and Gian Antonio Susto. Algorithmic fairness datasets: the story so far. Data Mining and Knowledge Discovery, 36(6):2074–2152, 2022.
  • Fei-Fei et al. (2004) Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. 2004.
  • Feldman et al. (2015) Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In SIGKDD, pp.  259–268, 2015.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Hall et al. (2023) Melissa Hall, Laura Gustafson, Aaron Adcock, Ishan Misra, and Candace Ross. Vision-language models performing zero-shot tasks exhibit gender-based disparities, 2023.
  • Hardt et al. (2016) Moritz Hardt, Eric Price, Nati Srebro, et al. Equality of opportunity in supervised learning. In NeurIPS, 2016.
  • Hazirbas et al. (2021) Caner Hazirbas, Joanna Bitton, Brian Dolhansky, Jacqueline Pan, Albert Gordo, and Cristian Canton Ferrer. Towards measuring fairness in ai: the casual conversations dataset. IEEE Transactions on Biometrics, Behavior, and Identity Science, 4(3):324–332, 2021.
  • Heek et al. (2020) Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020. URL http://github.com/google/flax.
  • Hendricks et al. (2018) Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In ECCV, 2018.
  • Hutchinson et al. (2022) Ben Hutchinson, Jason Baldridge, and Vinodkumar Prabhakaran. Underspecification in scene description-to-depiction tasks, 2022.
  • Idrissi et al. (2022) Badr Youbi Idrissi, Martin Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy, 2022.
  • Jia et al. (2021) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pp.  4904–4916. PMLR, 2021.
  • Kamiran & Calders (2009) Faisal Kamiran and Toon Calders. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, 2009.
  • Karkkainen & Joo (2021) Kimmo Karkkainen and Jungseock Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1548–1558, 2021.
  • Kather et al. (2016) Jakob Nikolas Kather, Cleo-Aron Weis, Francesco Bianconi, Susanne M Melchers, Lothar R Schad, Timo Gaiser, Alexander Marx, and Frank Gerrit Z"ollner. Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988, 2016.
  • Kay et al. (2015) Matthew Kay, Cynthia Matuszek, and Sean A Munson. Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd annual ACM conference on human factors in computing systems, pp.  3819–3828, 2015.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kirichenko et al. (2022) Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations, 2022.
  • Kleinberg et al. (2016) Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016.
  • Kohavi (1996) Ron Kohavi. Scaling up the accuracy of naive-Bayes classifiers: A decision-tree hybrid. In SIGKDD, 1996.
  • Krause et al. (2013) Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Kudo & Richardson (2018) Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pp.  66–71. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-2012. URL https://doi.org/10.18653/v1/d18-2012.
  • Lin et al. (2015) Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
  • Lum & Johndrow (2016) Kristian Lum and James Johndrow. A statistical framework for fair predictive algorithms. arXiv preprint arXiv:1610.08077, 2016.
  • Madras et al. (2018) David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations, 2018.
  • Makar et al. (2022) Maggie Makar, Ben Packer, Dan Moldovan, Davis Blalock, Yoni Halpern, and Alexander D’Amour. Causally motivated shortcut removal using auxiliary labels. In International Conference on Artificial Intelligence and Statistics, pp.  739–766. PMLR, 2022.
  • Mehrabi et al. (2019) Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019.
  • Mehrotra & Celis (2021) Anay Mehrotra and L Elisa Celis. Mitigating bias in set selection with noisy protected attributes. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  237–248, 2021.
  • Middleton et al. (2016) Joel A Middleton, Marc A Scott, Ronli Diakow, and Jennifer L Hill. Bias amplification and bias unmasking. Political Analysis, 24(3):307–323, 2016.
  • Minderer et al. (2022a) Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022a.
  • Minderer et al. (2022b) Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In European Conference on Computer Vision, pp.  728–755. Springer, 2022b.
  • Mishkin et al. (2022) Pamela Mishkin, Lama Ahmad, Miles Brundage, Gretchen Krueger, and Girish Sastry. Dall·e 2 preview - risks and limitations, 2022.
  • Mitchell et al. (2019) Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. pp.  220–229, 2019. doi: 10.1145/3287560.3287596. URL https://doi.org/10.1145/3287560.3287596.
  • Monk (2019) Ellis Monk. The monk skin tone scale, 2019. URL https://skintone.google.
  • Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  • Parkhi et al. (2012) O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.
  • Sagawa et al. (2020) Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, and Percy Liang. An investigation of why overparameterization exacerbates spurious correlations. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  8346–8356. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sagawa20a.html.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  • Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
  • Schumann et al. (2021) Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Rebecca Pantofaru. A step toward more inclusive people annotations for fairness. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2021.
  • Sellergren et al. (2022) Andrew B Sellergren, Christina Chen, Zaid Nabulsi, Yuanzhen Li, Aaron Maschinot, Aaron Sarna, Jenny Huang, Charles Lau, Sreenivasa Raju Kalidindi, Mozziyar Etemadi, et al. Simplified transfer learning for chest radiography models using less data. Radiology, 305(2):454–465, 2022.
  • Stoyanovich et al. (2018) Julia Stoyanovich, Ke Yang, and HV Jagadish. Online set selection with fairness and diversity constraints. In Proceedings of the EDBT Conference, 2018.
  • Tan et al. (2020) Shuhan Tan, Yujun Shen, and Bolei Zhou. Improving the fairness of deep generative models without retraining. arXiv preprint arXiv:2012.04842, 2020.
  • Tang et al. (2021) Ruixiang Tang, Mengnan Du, Yuening Li, Zirui Liu, Na Zou, and Xia Hu. Mitigating gender bias in captioning systems. In WWW, pp.  633–645, 2021.
  • Tiu et al. (2022) Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar. Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering, pp.  1–8, 2022.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • Veitch et al. (2021) Victor Veitch, Alexander D’Amour, Steve Yadlowsky, and Jacob Eisenstein. Counterfactual invariance to spurious correlations: Why and how to pass stress tests. May 2021.
  • Wang et al. (2021) Jialu Wang, Yang Liu, and Xin Eric Wang. Are gender-neutral queries really gender-neutral? mitigating gender bias in image search, 2021.
  • Wang et al. (2022) Junyang Wang, Yi Zhang, and Jitao Sang. Fairclip: Social bias elimination based on attribute prototype learning and representation neutralization. arXiv preprint arXiv:2210.14562, 2022.
  • Wang et al. (2019) Tianlu Wang, Jieyu Zhao, Mark Yatskar, Kai-Wei Chang, and Vicente Ordonez. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In ICCV, 2019.
  • Wang et al. (2023) Zihao Wang, Lin Gui, Jeffrey Negrea, and Victor Veitch. Concept algebra for text-controlled vision models, 2023.
  • Welinder et al. (2010) P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • Wilcoxon (1992) Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution, pp.  196–202. Springer, 1992.
  • Yan et al. (2020) Shen Yan, Di Huang, and Mohammad Soleymani. Mitigating biases in multimodal personality assessment. In Proceedings of the 2020 International Conference on Multimodal Interaction, pp.  361–369, 2020.
  • Yang et al. (2020) Forest Yang, Moustapha Cisse, and Sanmi Koyejo. Fairness with overlap** groups. arXiv preprint arXiv:2006.13485, 2020.
  • Yang & Newsam (2010) Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), 2010.
  • Yeh & Lien (2009) I-Cheng Yeh and Che-hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2009.
  • Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  • Yu et al. (2022a) Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
  • Yu et al. (2022b) Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022b.
  • Yuan et al. (2021) Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  • Zafar et al. (2017) Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In International Conference on World Wide Web, 2017.
  • Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In ICML, 2013.
  • Zhai et al. (2022a) Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In CVPR, 2022a.
  • Zhai et al. (2022b) Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp.  18123–18133, 2022b.
  • Zhai et al. (2023) Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  11975–11986, October 2023.
  • Zhang et al. (2023) Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Matthew P. Lungren, Tristan Naumann, and Hoifung Poon. Large-scale domain-specific pretraining for biomedical vision-language processing, 2023.
  • Zhang et al. (2017) Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
  • Zhao et al. (2021) Dora Zhao, Angelina Wang, and Olga Russakovsky. Understanding and evaluating racial biases in image captioning. In ICCV, 2021.
  • Zhao et al. (2017) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Men also like shop**: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457, 2017.

Appendix A Appendix

A.1 Sensitive Attribute Annotation

We infer attributes from both modalities. Using a vision classifier for images and regular expression matching on captions. Image modality is needed because even highly reputable websites have poor alt-text coverage (Bigham et al., 2006). Using the text modality is needed because images could be interpreted incorrectly noisy annotators. For example, (Birhane et al., 2021) shows how a photograph of a female astronaut could be misinterpreted by a image classifier as a picture of a “housewife in an orange jumpsuit.” Hence, both modalities are used in our analysis.

Why do we infer attributes from both modalities?

As demonstrated in Table 4, the extracted signals from image and text can be incomplete, and in some cases one of the modalities creates an association that is not present in the image or text alone.

Table 4: Dataset examples. The first two columns contain the image (resized to 224px) and text (translated to English), as seen by the model. The third and fourth column show the profession and gender attributes that are extracted from the image (via an image classifier) and text (via regular expression matching), respectively.

Image Text 𝐬v,𝐲vsuperscript𝐬𝑣superscript𝐲𝑣\mathbf{s}^{v},\mathbf{y}^{v}bold_s start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT 𝐬t,𝐲tsuperscript𝐬𝑡superscript𝐲𝑡\mathbf{s}^{t},\mathbf{y}^{t}bold_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT Comments [Uncaptioned image] pexels.com female judge sentencing judge perceived woman Example of an image where the image signals are incomplete, and the text signals provide additional information that would otherwise be lost. [Uncaptioned image] pixabay.com archery bow and arrow aim sport perceived woman The gender is not mentioned in the text description, and the regular expression does not trigger for “sport”, but the image signal extracts both sensitive attribute and profession. [Uncaptioned image] vecteezy.com a male doctor wearing a green scrub sits on a couch and is falling asleep perceived woman doctor perceived man Here the text introduces an additional association between the profession “doctor” and the gender, which was not extracted from the image alone. [Uncaptioned image] pexels.com photo of man surfing during daytime sport perceived man In this case, the image signal only points towards a profession, but the text adds the association with “perceived men”.

Why do we concatenate them?

As explained in Section 4, we concatenate annotations from both modalities. For instance, to remove correlation between gender and occupation, we de-correlate all combinations of image- and text-based annotations. Since 𝐬𝐬\mathbf{s}bold_s and 𝐲𝐲\mathbf{y}bold_y are binary vectors, zero correlations imply independence so, by the data processing inequality (Cover & Thomas, 1991), this offers a stronger bias guarantee than logical disjunction, where an attribute is marked present if detected in any modality. We illustrate this with a synthetic example next.

Consider the example shown in the table below, where we have 8 examples and the annotations of both 𝐬𝐬\mathbf{s}bold_s and 𝐲𝐲\mathbf{y}bold_y are provided from both image and text modalities.

𝐬𝐬\mathbf{s}bold_s (image)

𝐲𝐲\mathbf{y}bold_y (image)

𝐬𝐬\mathbf{s}bold_s (text)

𝐲𝐲\mathbf{y}bold_y (text)

𝐬𝐬\mathbf{s}bold_s (image or text)

𝐲𝐲\mathbf{y}bold_y (image or text)

Ex 1

1

1

1

0

1

1

Ex 2

1

0

1

0

1

0

Ex 3

1

1

0

1

1

1

Ex 4

1

0

1

0

1

0

Ex 5

0

0

0

1

0

1

Ex 6

0

0

0

0

0

0

Ex 7

0

0

0

1

0

1

Ex 8

0

0

0

0

0

0

In this example, if we assume that an attribute is present when it is detected by any modality, we do not observe any bias in the data (rightmost two columns). In particular, the correlation between 𝐬𝐬\mathbf{s}bold_s and 𝐲𝐲\mathbf{y}bold_y is zero and both attributes are statistically independent of each other. However, both the image and text towers see different pictures, where 𝐬𝐬\mathbf{s}bold_s and 𝐲𝐲\mathbf{y}bold_y are correlated with an absolute correlation coefficient exceeding |ρ|>0.5𝜌0.5|\rho|>0.5| italic_ρ | > 0.5. Hence, by ensuring that 𝐬𝐬\mathbf{s}bold_s (image) is decorrelated from both 𝐲𝐲\mathbf{y}bold_y (image) and 𝐲𝐲\mathbf{y}bold_y (text) and vice versa, a stronger bias guarantee is ensured.

A.2 Data Balancing Algorithm

A.2.1 Proof of Proposition 1

First, we prove that finding a sample weight function 𝐪𝐪\mathbf{q}bold_q that mitigates both types of bias in Definitions 1 and 2 is equivalent to finding a solution to the set of constraints in (3).

Since 𝐲rsubscript𝐲𝑟\mathbf{y}_{r}bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and 𝐬ksubscript𝐬𝑘\mathbf{s}_{k}bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are both binary-valued random variables, having zero demographic parity (DP) is equivalent to zero covariance (Alabdulmohsin & Lucic, 2021) so the DP condition is equivalent to:

𝔼[𝐪(𝐬kπk)𝐲r]=0𝔼delimited-[]𝐪subscript𝐬𝑘subscript𝜋𝑘subscript𝐲𝑟0\mathbb{E}\left[\mathbf{q}\cdot(\mathbf{s}_{k}-\pi_{k})\cdot\mathbf{y}_{r}% \right]=0blackboard_E [ bold_q ⋅ ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] = 0 (6)

because:

𝔼[𝐪(𝐬kπk)𝐲r]=𝔼[𝐪𝐬k𝐲r]πk𝔼[𝐪𝐲]=𝔼[𝐪𝐬k𝐲r]𝔼[𝐪𝐬k]𝔼[𝐪]𝔼[𝐪𝐲]𝔼delimited-[]𝐪subscript𝐬𝑘subscript𝜋𝑘subscript𝐲𝑟𝔼delimited-[]𝐪subscript𝐬𝑘subscript𝐲𝑟subscript𝜋𝑘𝔼delimited-[]𝐪𝐲𝔼delimited-[]𝐪subscript𝐬𝑘subscript𝐲𝑟𝔼delimited-[]𝐪subscript𝐬𝑘𝔼delimited-[]𝐪𝔼delimited-[]𝐪𝐲\mathbb{E}\left[\mathbf{q}\cdot(\mathbf{s}_{k}-\pi_{k})\cdot\mathbf{y}_{r}% \right]=\mathbb{E}\left[\mathbf{q}\cdot\mathbf{s}_{k}\cdot\mathbf{y}_{r}\right% ]-\pi_{k}\,\mathbb{E}\left[\mathbf{q}\cdot\mathbf{y}\right]=\mathbb{E}\left[% \mathbf{q}\cdot\mathbf{s}_{k}\cdot\mathbf{y}_{r}\right]-\frac{\mathbb{E}[% \mathbf{q}\cdot\mathbf{s}_{k}]}{\mathbb{E}[\mathbf{q}]}\cdot\mathbb{E}\left[% \mathbf{q}\cdot\mathbf{y}\right]blackboard_E [ bold_q ⋅ ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] = blackboard_E [ bold_q ⋅ bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ bold_q ⋅ bold_y ] = blackboard_E [ bold_q ⋅ bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] - divide start_ARG blackboard_E [ bold_q ⋅ bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] end_ARG start_ARG blackboard_E [ bold_q ] end_ARG ⋅ blackboard_E [ bold_q ⋅ bold_y ]

By dividing both sides by 𝔼[𝐪]𝔼delimited-[]𝐪\mathbb{E}[\mathbf{q}]blackboard_E [ bold_q ], we have:

𝔼[𝐪(𝐬kπk)𝐲r]𝔼[𝐪]=𝔼[𝐪𝐬k𝐲r]𝔼[𝐪]𝔼[𝐪𝐬k]𝔼[𝐪]𝔼[𝐪𝐲]𝔼[𝐪],𝔼delimited-[]𝐪subscript𝐬𝑘subscript𝜋𝑘subscript𝐲𝑟𝔼delimited-[]𝐪𝔼delimited-[]𝐪subscript𝐬𝑘subscript𝐲𝑟𝔼delimited-[]𝐪𝔼delimited-[]𝐪subscript𝐬𝑘𝔼delimited-[]𝐪𝔼delimited-[]𝐪𝐲𝔼delimited-[]𝐪\frac{\mathbb{E}\left[\mathbf{q}\cdot(\mathbf{s}_{k}-\pi_{k})\cdot\mathbf{y}_{% r}\right]}{\mathbb{E}[\mathbf{q}]}=\frac{\mathbb{E}\left[\mathbf{q}\cdot% \mathbf{s}_{k}\cdot\mathbf{y}_{r}\right]}{\mathbb{E}[\mathbf{q}]}-\frac{% \mathbb{E}[\mathbf{q}\cdot\mathbf{s}_{k}]}{\mathbb{E}[\mathbf{q}]}\cdot\frac{% \mathbb{E}\left[\mathbf{q}\cdot\mathbf{y}\right]}{\mathbb{E}[\mathbf{q}]},divide start_ARG blackboard_E [ bold_q ⋅ ( bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] end_ARG start_ARG blackboard_E [ bold_q ] end_ARG = divide start_ARG blackboard_E [ bold_q ⋅ bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ] end_ARG start_ARG blackboard_E [ bold_q ] end_ARG - divide start_ARG blackboard_E [ bold_q ⋅ bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] end_ARG start_ARG blackboard_E [ bold_q ] end_ARG ⋅ divide start_ARG blackboard_E [ bold_q ⋅ bold_y ] end_ARG start_ARG blackboard_E [ bold_q ] end_ARG ,

which is the covariance w.r.t. the sample weights 𝐪𝐪\mathbf{q}bold_q. Hence, if condition (6) is satisfied, the association bias is zero.

Second, satisfying the representation bias (RB) constraints is equivalent to setting πk=𝔼[𝐪𝐬k]/𝔼[𝐪]subscript𝜋𝑘𝔼delimited-[]𝐪subscript𝐬𝑘𝔼delimited-[]𝐪\pi_{k}={\mathbb{E}[\mathbf{q}\cdot\mathbf{s}_{k}]}/{\mathbb{E}[\mathbf{q}]}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = blackboard_E [ bold_q ⋅ bold_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] / blackboard_E [ bold_q ] by definition, which is equivalent to the second condition in (3).

To solve the optimization problem (4), we solve the dual problem. To simplify notation, suppose initially that the dataset is finite and let qn𝑞superscript𝑛q\in\mathbb{R}^{n}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a vector of length n𝑛nitalic_n (corresponding to all n𝑛nitalic_n examples). The optimization problem is:

minqn,b2m(c+1)subscriptformulae-sequence𝑞superscript𝑛𝑏superscript2𝑚𝑐1\displaystyle\min_{q\in\mathbb{R}^{n},b\in\mathbb{R}^{2m(c+1)}}roman_min start_POSTSUBSCRIPT italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_b ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_m ( italic_c + 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 12(qη𝟏)TU(qη𝟏)+Vb12superscript𝑞𝜂1𝑇𝑈𝑞𝜂1𝑉𝑏\displaystyle\frac{1}{2}(q-\eta\mathbf{1})^{T}U(q-\eta\mathbf{1})+\,V\cdot bdivide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_q - italic_η bold_1 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ( italic_q - italic_η bold_1 ) + italic_V ⋅ italic_b
s.t. Aqb𝐴𝑞𝑏\displaystyle Aq\leq bitalic_A italic_q ≤ italic_b
𝟏Tq=ηnsuperscript1𝑇𝑞𝜂𝑛\displaystyle\mathbf{1}^{T}q=\eta nbold_1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q = italic_η italic_n
qQ𝑞𝑄\displaystyle q\leq Qitalic_q ≤ italic_Q
q,b0,𝑞𝑏0\displaystyle q,b\geq 0,italic_q , italic_b ≥ 0 ,

where U𝑈Uitalic_U is a diagonal matrix containing the utilities of all examples and A𝐴Aitalic_A is a matrix formed from a constraints in (3) (a Numpy-based implementation of the columns of A𝐴Aitalic_A is shown in Figure 7 (right)). We denote the i𝑖iitalic_i-th column of A𝐴Aitalic_A by a(i)superscript𝑎𝑖a^{(i)}italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and refer to it as the “bias vector”.

The Lagrangian is:

L(q,b,v,μ,α,β)=12(qη𝟏)TU(qη𝟏)+Vb+vT(Aqb)+μ(𝟏Tqηn)βTqδTb+αT(qQ𝟏)𝐿𝑞𝑏𝑣𝜇𝛼𝛽12superscript𝑞𝜂1𝑇𝑈𝑞𝜂1𝑉𝑏superscript𝑣𝑇𝐴𝑞𝑏𝜇superscript1𝑇𝑞𝜂𝑛superscript𝛽𝑇𝑞superscript𝛿𝑇𝑏superscript𝛼𝑇𝑞𝑄1L(q,b,v,\mu,\alpha,\beta)=\frac{1}{2}(q-\eta\mathbf{1})^{T}U(q-\eta\mathbf{1})% +Vb+v^{T}(Aq-b)+\mu(\mathbf{1}^{T}q-\eta n)-\beta^{T}q-\delta^{T}b+\alpha^{T}(% q-Q\mathbf{1})italic_L ( italic_q , italic_b , italic_v , italic_μ , italic_α , italic_β ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_q - italic_η bold_1 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_U ( italic_q - italic_η bold_1 ) + italic_V italic_b + italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_q - italic_b ) + italic_μ ( bold_1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q - italic_η italic_n ) - italic_β start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q - italic_δ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_b + italic_α start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_q - italic_Q bold_1 )

subject to v,β,δ,α0𝑣𝛽𝛿𝛼0v,\beta,\delta,\alpha\geq 0italic_v , italic_β , italic_δ , italic_α ≥ 0.

Taking the gradient w.r.t. q𝑞qitalic_q and setting it to zero:

q=η𝟏U1(ATv+μ𝟏β+α).𝑞𝜂1superscript𝑈1superscript𝐴𝑇𝑣𝜇1𝛽𝛼q=\eta\mathbf{1}-U^{-1}\left(A^{T}v+\mu\mathbf{1}-\beta+\alpha\right).italic_q = italic_η bold_1 - italic_U start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_v + italic_μ bold_1 - italic_β + italic_α ) . (7)

Setting the gradient w.r.t. b𝑏bitalic_b to zero gives us:

0vV𝟏.0𝑣𝑉10\leq v\leq V\mathbf{1}.0 ≤ italic_v ≤ italic_V bold_1 .

Plugging the expression for q𝑞qitalic_q back into L𝐿Litalic_L, we have the minimization problem:

min0vV𝟏,μi=1infβ,α0{12ui(vTa(i)+μ+αβ)2ηvTa(i)η(αβ)+Qα}.subscriptformulae-sequence0𝑣𝑉1𝜇subscript𝑖1subscriptinfimum𝛽𝛼012subscript𝑢𝑖superscriptsuperscript𝑣𝑇superscript𝑎𝑖𝜇𝛼𝛽2𝜂superscript𝑣𝑇superscript𝑎𝑖𝜂𝛼𝛽𝑄𝛼\min_{0\leq v\leq V\mathbf{1},\mu}\sum_{i=1}\inf_{\beta,\alpha\geq 0}\left\{% \frac{1}{2u_{i}}\left(v^{T}a^{(i)}+\mu+\alpha-\beta\right)^{2}-\eta v^{T}a^{(i% )}-\eta(\alpha-\beta)+Q\alpha\right\}.roman_min start_POSTSUBSCRIPT 0 ≤ italic_v ≤ italic_V bold_1 , italic_μ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_β , italic_α ≥ 0 end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_μ + italic_α - italic_β ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_η italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_η ( italic_α - italic_β ) + italic_Q italic_α } . (8)

We will use the following fact:

Lemma 1.

ω=[λt+ξ]+superscript𝜔superscriptdelimited-[]𝜆𝑡𝜉\omega^{\star}=[\lambda t+\xi]^{+}italic_ω start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = [ italic_λ italic_t + italic_ξ ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a minimizer to the objective function:

minω0{12λ(ξω)2tω}.subscript𝜔012𝜆superscript𝜉𝜔2𝑡𝜔\min_{\omega\geq 0}\left\{\frac{1}{2\lambda}(\xi-\omega)^{2}-t\omega\right\}.roman_min start_POSTSUBSCRIPT italic_ω ≥ 0 end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 italic_λ end_ARG ( italic_ξ - italic_ω ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_t italic_ω } .

Returning to (8), we have by complementary slackness (Boyd & Vandenberghe, 2004) that either α=0𝛼0\alpha=0italic_α = 0 or β=0𝛽0\beta=0italic_β = 0 so we consider each case separately.

Case I:

If α=0𝛼0\alpha=0italic_α = 0, the minimizer β(i)superscript𝛽𝑖\beta^{(i)}italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to each of the inner optimization problems is by Lemma 1 β(i)=[w(i)ηu(i)]+superscript𝛽𝑖superscriptdelimited-[]superscript𝑤𝑖𝜂superscript𝑢𝑖\beta^{(i)}=[w^{(i)}-\eta u^{(i)}]^{+}italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_η italic_u start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, where w(i)=vTa(i)+μsuperscript𝑤𝑖superscript𝑣𝑇superscript𝑎𝑖𝜇w^{(i)}=v^{T}a^{(i)}+\muitalic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_μ. By plugging this into (8) and using (7), we have:

q(i)=ηw(i)β(i)u(i),superscript𝑞𝑖𝜂superscript𝑤𝑖superscript𝛽𝑖superscript𝑢𝑖q^{(i)}=\eta-\frac{w^{(i)}-\beta^{(i)}}{u^{(i)}},italic_q start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_η - divide start_ARG italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_u start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG , (9)

and the stochastic gradients:

v=q(i)ηa(i),μ=1q(i)η.formulae-sequencesubscript𝑣superscript𝑞𝑖𝜂superscript𝑎𝑖subscript𝜇1superscript𝑞𝑖𝜂\nabla_{v}=-\frac{q^{(i)}}{\eta}\,a^{(i)},\quad\quad\nabla_{\mu}=1-\frac{q^{(i% )}}{\eta}.∇ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = - divide start_ARG italic_q start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG italic_a start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , ∇ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT = 1 - divide start_ARG italic_q start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_η end_ARG . (10)
Case II:

If β=0𝛽0\beta=0italic_β = 0, the minimizer α(i)superscript𝛼𝑖\alpha^{(i)}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to each of the inner optimization problems is by Lemma 1 α(i)=[u(i)(ηQ)w(i)]+superscript𝛼𝑖superscriptdelimited-[]superscript𝑢𝑖𝜂𝑄superscript𝑤𝑖\alpha^{(i)}=[u^{(i)}\left(\eta-Q\right)-w^{(i)}]^{+}italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ italic_u start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_η - italic_Q ) - italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Note that we indeed have that α(i)β(i)=0superscript𝛼𝑖superscript𝛽𝑖0\alpha^{(i)}\cdot\beta^{(i)}=0italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_β start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 0 as expected. Similar to before, we have:

q(i)=ηw(i)+α(i)u(i),superscript𝑞𝑖𝜂superscript𝑤𝑖superscript𝛼𝑖superscript𝑢𝑖q^{(i)}=\eta-\frac{w^{(i)}+\alpha^{(i)}}{u^{(i)}},italic_q start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_η - divide start_ARG italic_w start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + italic_α start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_u start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG , (11)

and the same stochastic gradients as in (10). From this, Algorithm 1 follows. By strong duality (Boyd & Vandenberghe, 2004), the solution returned by Algorithm 1 is the optimal solution to the minimization problem in (4).

A.2.2 Proof of Proposition 2

As shown in Appendix A.2.1, Algorithm 1 minimizes the loss in (8) via stochastic gradient descent. Writing the optimization variable as w=(v,μ)𝑤𝑣𝜇w=(v,\mu)italic_w = ( italic_v , italic_μ ), the stochastic gradient g𝑔gitalic_g in (10) is bounded in norm by:

g2Q2η2𝐚2+(1Qη)2Q2η2(1+𝐚2)subscriptnorm𝑔2superscript𝑄2superscript𝜂2superscriptnorm𝐚2superscript1𝑄𝜂2superscript𝑄2superscript𝜂21superscriptnorm𝐚2||g||_{2}\leq\sqrt{\frac{Q^{2}}{\eta^{2}}||\mathbf{a}||^{2}+\left(1-\frac{Q}{% \eta}\right)^{2}}\leq\sqrt{\frac{Q^{2}}{\eta^{2}}\left(1+||\mathbf{a}||^{2}% \right)}| | italic_g | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ square-root start_ARG divide start_ARG italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | | bold_a | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - divide start_ARG italic_Q end_ARG start_ARG italic_η end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ square-root start_ARG divide start_ARG italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( 1 + | | bold_a | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG (12)

because 0𝐪Q0𝐪𝑄0\leq\mathbf{q}\leq Q0 ≤ bold_q ≤ italic_Q in all iterations (see (5)), and the second inequality follows from the fact that Qη𝑄𝜂Q\geq\etaitalic_Q ≥ italic_η (maximum must be larger than the mean). The bias configuration 𝐚𝐚\mathbf{a}bold_a satisfies 𝐚=1+ϵsubscriptnorm𝐚1italic-ϵ||\mathbf{a}||_{\infty}=1+\epsilon| | bold_a | | start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1 + italic_ϵ, where ϵ=max{ϵD,ϵR}italic-ϵsubscriptitalic-ϵ𝐷subscriptitalic-ϵ𝑅\epsilon=\max\{\epsilon_{D},\,\epsilon_{R}\}italic_ϵ = roman_max { italic_ϵ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT }. Hence: 𝐚2(2(1+ϵ)m(c+1))2superscriptnorm𝐚2superscript21italic-ϵ𝑚𝑐12||\mathbf{a}||^{2}\leq\left(2(1+\epsilon)m(c+1)\right)^{2}| | bold_a | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( 2 ( 1 + italic_ϵ ) italic_m ( italic_c + 1 ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT because 𝐚2m(c+1)𝐚superscript2𝑚𝑐1\mathbf{a}\in\mathbb{R}^{2m(c+1)}bold_a ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_m ( italic_c + 1 ) end_POSTSUPERSCRIPT. Since ϵ1italic-ϵ1\epsilon\leq 1italic_ϵ ≤ 1 (otherwise the constraints are void) while m,c1𝑚𝑐1m,c\geq 1italic_m , italic_c ≥ 1, we have the simplified bound:

g25Qm(c+1)η.subscriptnorm𝑔25𝑄𝑚𝑐1𝜂||g||_{2}\leq\frac{5Qm(c+1)}{\eta}.| | italic_g | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ divide start_ARG 5 italic_Q italic_m ( italic_c + 1 ) end_ARG start_ARG italic_η end_ARG . (13)

Denote the optimal solutions vsuperscript𝑣v^{\star}italic_v start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and μsuperscript𝜇\mu^{\star}italic_μ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and write: R=v2+(μ)2𝑅superscriptnormsuperscript𝑣2superscriptsuperscript𝜇2R=||v^{\star}||^{2}+(\mu^{\star})^{2}italic_R = | | italic_v start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_μ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Using the learning rate schedule τt=1/tsubscript𝜏𝑡1𝑡\tau_{t}=1/\sqrt{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 / square-root start_ARG italic_t end_ARG and the well-known convergence rates (Boyd & Mutapcic, 2008), we have:

mint𝔼[Ft]FR2+(5Qm(c+1)η)2tτt22tτt=O((Qmcη)2logtt),subscript𝑡𝔼delimited-[]subscript𝐹𝑡superscript𝐹superscript𝑅2superscript5𝑄𝑚𝑐1𝜂2subscript𝑡superscriptsubscript𝜏𝑡22subscript𝑡subscript𝜏𝑡𝑂superscript𝑄𝑚𝑐𝜂2𝑡𝑡\min_{t}\;\mathbb{E}[F_{t}]-F^{\star}\leq\frac{R^{2}+\left(\frac{5Qm(c+1)}{% \eta}\right)^{2}\,\sum_{t}\tau_{t}^{2}}{2\sum_{t}\tau_{t}}=O\left(\left(\frac{% Qmc}{\eta}\right)^{2}\frac{\log t}{\sqrt{t}}\right),roman_min start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_E [ italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] - italic_F start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( divide start_ARG 5 italic_Q italic_m ( italic_c + 1 ) end_ARG start_ARG italic_η end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = italic_O ( ( divide start_ARG italic_Q italic_m italic_c end_ARG start_ARG italic_η end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_log italic_t end_ARG start_ARG square-root start_ARG italic_t end_ARG end_ARG ) ,

where Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the loss function at iteration t𝑡titalic_t.

A.2.3 Empirical Verification

To verify the algorithm, we run it with the following constraints: (1) balance perceived gender attribute, (2) balance the age group attribute, (3) remove correlations between perceived gender and occupation, and (4) remove correlations between perceived gender and age. The goal is to satisfy all constraints simultaneously. In this illustrative example, however, we consider an attribute present if it is detected in any modality. Within each sensitive attribute axis, we set the target distribution to be the median distribution across all categories (i.e. 12% for both males and females and 1% for all age groups). Figure 8 displays a summary of the results. As shown in the figure, the algorithm mitigates both types of bias simultaneously.

Refer to caption Refer to caption Absolute Pearson Correlations perceived gender ×\times× occupation perceived gender ×\times× age group median max median max before 0.010 0.259 0.102 0.218 after 0.001 0.003 0.002 0.057

Figure 8: top:A comparison of the distribution of subgroups before and after applying the data diversity algorithm in Figure 7 in which we have four sets of constraints (see Section A.2.3). The target distribution is 12% for both perceived gender categories and 1% for all age groups. bottom: Pearson correlation scores (max and median) between perceived gender and occupation/age. Algorithm 1 mitigates both first-order and second order biases simultaneously.

A.2.4 Comparison against pre-processing algorithms

We compare, for reference, the data balancing algorithm against prior preprocessing in the classical binary classification setting.

Specifically, we evaluate the algorithm on the Adult income dataset (Kohavi, 1996) and the Default of Credit Card Clients (DCCC) dataset (Yeh & Lien, 2009), both from the UCI ML Repository (Blake & Merz, 1998) and are among the most widely used benchmark datasets in the ML fairness literature (Fabris et al., 2022). The Adult dataset contains 48,842 records with 14 attributes and the goal is to predict a binary summary of income level. The DCCC dataset contains 30,000 records with 24 attributes, where the goal is to predict if a client will default on credit card payment. In both datasets, binary sex is used as a sensitive attribute.

Besides preprocessing methods, we also include in-processing and post-processing in our comparison. However, the intent here is not to compete, since our goal is to balance the data prior to training.

We compare against the correlation remover method (Dudik et al., 2020) and reduce-to-binary (Alabdulmohsin et al., 2022a), reduction to cost-sensitive rules (Agarwal et al., 2018), threshold optimizer (Hardt et al., 2016), and randomized threshold optimizer (Alabdulmohsin & Lucic, 2021), and use the implementations released by authors or in the FairLearn package (Dudik et al., 2020). All hyperparameters are kept to their default setting. The base classifier is a two-layer MLP with 128 hidden units, ReLU activations (Nair & Hinton, 2010), that optimizes the log-loss using Adam (Kingma & Ba, 2014) with an initial learning rate of 0.001.

For evaluation metrics, we report demographic parity (DP), classification error rate, and the balanced error rate (average error rate across both subgroups).

As shown in Table 5, while data balancing as a mitigation option does not outperform in- and post-processing methods, the data balancing algorithm performs on par with other pre-processing techniques in mitigating bias without impacting model’s quality. Importantly, our algorithm can handle a much broader setting than prior methods, including an arbitrary number of overlap** attributes and labels.

Table 5: Comparison of the data balancing algorithm against prior bias mitigation methods in the binary classification setting: correlation remover (CR), reduce-to-binary (R2B), reduction to cost-sensitive rules, threshold-optimizer (TO), randomized threshold optimizer (RTO). See Section A.2.4 for details. Last row is for the proposed data balancing algorithm.
Adult Default on Credit Card
DP Error Balanced Error DP Error Balanced Error

Baseline

18.6±.2plus-or-minus18.6.218.6\pm.218.6 ± .2

14.5±.1plus-or-minus14.5.114.5\pm.114.5 ± .1

12.7±.1plus-or-minus12.7.112.7\pm.112.7 ± .1

02.1±.3plus-or-minus02.1.302.1\pm.302.1 ± .3

17.5±.0plus-or-minus17.5.017.5\pm.017.5 ± .0

17.6±.0plus-or-minus17.6.017.6\pm.017.6 ± .0

in-

Reduction

01.0±.0plus-or-minus01.0.001.0\pm.001.0 ± .0

16.4±.1plus-or-minus16.4.116.4\pm.116.4 ± .1

14.9±.1plus-or-minus14.9.114.9\pm.114.9 ± .1

01.4±.1plus-or-minus01.4.101.4\pm.101.4 ± .1

17.9±.2plus-or-minus17.9.217.9\pm.217.9 ± .2

18.2±.1plus-or-minus18.2.118.2\pm.118.2 ± .1

post-

TO

00.9±.3plus-or-minus00.9.300.9\pm.300.9 ± .3

19.9±.0plus-or-minus19.9.019.9\pm.019.9 ± .0

20.8±.1plus-or-minus20.8.120.8\pm.120.8 ± .1

00.7±.3plus-or-minus00.7.300.7\pm.300.7 ± .3

18.3±.1plus-or-minus18.3.118.3\pm.118.3 ± .1

18.8±.1plus-or-minus18.8.118.8\pm.118.8 ± .1

post-

RTO

00.8±.2plus-or-minus00.8.200.8\pm.200.8 ± .2

16.9±.1plus-or-minus16.9.116.9\pm.116.9 ± .1

16.5±.0plus-or-minus16.5.016.5\pm.016.5 ± .0

03.3±.2plus-or-minus03.3.203.3\pm.203.3 ± .2

20.7±.1plus-or-minus20.7.120.7\pm.120.7 ± .1

20.7±.2plus-or-minus20.7.220.7\pm.220.7 ± .2

pre-

CR

18.9±.1plus-or-minus18.9.118.9\pm.118.9 ± .1

14.7±.2plus-or-minus14.7.214.7\pm.214.7 ± .2

12.9±.2plus-or-minus12.9.212.9\pm.212.9 ± .2

00.8±.2plus-or-minus00.8.200.8\pm.200.8 ± .2

19.6±.3plus-or-minus19.6.319.6\pm.319.6 ± .3

20.0±.2plus-or-minus20.0.220.0\pm.220.0 ± .2

pre-

R2B

08.3±.1plus-or-minus08.3.108.3\pm.108.3 ± .1

15.4±.0plus-or-minus15.4.015.4\pm.015.4 ± .0

13.4±.1plus-or-minus13.4.113.4\pm.113.4 ± .1

00.9±.1plus-or-minus00.9.100.9\pm.100.9 ± .1

28.8±.2plus-or-minus28.8.228.8\pm.228.8 ± .2

29.1±.2plus-or-minus29.1.229.1\pm.229.1 ± .2

pre- Balancing

09.1±.2plus-or-minus09.1.209.1\pm.209.1 ± .2

15.6±.2plus-or-minus15.6.215.6\pm.215.6 ± .2

13.7±.2plus-or-minus13.7.213.7\pm.213.7 ± .2

01.2±.1plus-or-minus01.2.101.2\pm.101.2 ± .1

17.9±.3plus-or-minus17.9.317.9\pm.317.9 ± .3

18.1±.2plus-or-minus18.1.218.1\pm.218.1 ± .2

A.3 Model Card

Model details following Mitchell et al. (2019).

  • Model Architecture: The model contains two towers, which are encoder-only vision transformers (ViT) (Dosovitskiy et al., 2020) for image and text modalities. We use architecture sizes ViT-S and ViT-B. For the image tower, we use either a patch size of 32×32323232\times 3232 × 32 or 16×16161616\times 1616 × 16. For the text tower, we use SentencePiece Kudo & Richardson (2018) tokenizer. The two towers are trained to maximize the similarity of image-text pairs via the contrastive loss.

  • Inputs: The model takes both an image and a text as input.

  • Outputs: The model outputs representations for image and text inputs.

  • Intended Use: The primary use is research on multimodal applications, such as zero-shot classification and retrieval. We use such models to study the effectiveness of data balancing as a bias mitigation strategy.

  • Known Caveats: As noted in several prior works, multimodal systems can pick up societal biases. While we demonstrate some of those issues in this work, our analysis is necessarily limited since fairness is a societal concept that cannot be reduced to simple statistical metrics.

  • System Description: The model is analyzed in a stand-alone setting, and not used as part of a larger system.

  • Upstream Dependencies: None.

  • Downstream Dependencies: None.

  • Hardware & Software: The model is implemented using JAX (Bradbury et al., 2018), Flax (Heek et al., 2020), and Big Vision (Beyer et al., 2022). It is trained on TPU v2.

  • Compute Requirements: Each model is trained on 8×8888\times 88 × 8 TPU v2 chips on 1B seen image-text pairs. A typical training run takes 1.3 days.

  • Model Initialization: The model is trained from a random initialization.

  • Model Size: ViT-S models have approximately 30M parameters per modality. ViT-B models have approximately 100M parameters per modality.

  • Training Dataset: We use an English-only subset of WebLI (Chen et al., 2022), which comprises of images with alt-texts from the public web. See (Chen et al., 2022) for an overview of the data collection process.

  • Evaluation Datasets: We evaluate the models on ImageNet-ILSRCV2012 (Deng et al., 2009), FairFace (Karkkainen & Joo, 2021), UTKFace (Zhang et al., 2017), and MIAP (Schumann et al., 2021).

A.4 Training Configuration

# input
batch_size = 16_384
shuffle_buffer_size = 250_000
pp = ’decode|resize(224)|value_range(-1,1)’
# model
model.temperature_init = 10.0
# optimizer
optax_name = ’scale_by_adafactor’
grad_clip_norm = 1.0
lr = 0.001
wd = 0.0001
# learning rate schedule
schedule.decay_type = ’rsqrt’
schedule.timescale = 5_000
schedule.warmup_steps = 5_000

A.5 Proxies

A.5.1 List

We queried a large language model (LLM) to provide a suggested list of proxies that might link perceived gender with occupation. The LLM recommended the list shown in Table 6.

bench

briefcase

book

board

tools

desk

computer

suit

gun

handcuff

tray

ashtray

food

coat

goggles

whistle

clipboard

watch

flour

gym

sport

scrubs

microscope

credit card

shop**

sewing

machine

cabinet

calculator

paint

backpack

headset

microphone

camera

helmet

vacuum

glassware

car

comb

scissor

toothbrush

phone

Table 6: Full list of proxies used in the experiments.

It justified each example by linking it to a particular occupation. For example, “bench” might provide a link to the occupation “judge,” “briefcase” might provide a link to the occupation “manager,” “book” might provide a link to the occupation “author,” and so on.

A.5.2 Causal Analysis of Data Balancing Approaches

In this section, we analyze the properties of the balanced and proxies datasets from a causal perspective, using the formalism presented in Veitch et al. (2021), and make a more formal argument motivating the use of proxies. Specifically, we highlight a case where balancing labels 𝐲𝐲\mathbf{y}bold_y on sensitive attributes 𝐬𝐬\mathbf{s}bold_s alone is not enough because there are unbalanced proxies 𝐰𝐰\mathbf{w}bold_w.

Causal Model

We assume that the generative process for images is anti-causal with respect to the label 𝐲𝐲\mathbf{y}bold_y, sensitive attribute 𝐬𝐬\mathbf{s}bold_s, and proxy 𝐰𝐰\mathbf{w}bold_w variables in our setting. In other words, we assume that these variables are underlying causes of the image/text pair 𝐱𝐱\mathbf{x}bold_x that is taken as input.

Refer to caption
Figure 9: DAGs describing causal models for image/text pairs 𝐱𝐱\mathbf{x}bold_x as a function of sensitive attributes 𝐬𝐬\mathbf{s}bold_s, labels 𝐲𝐲\mathbf{y}bold_y, and proxies 𝐰𝐰\mathbf{w}bold_w. Solid arrows represent causal relationships; bidirectional dashed arrows represent non-causal relationships introduced by confounding or selection. (a) is a general generative model without proxies; (b) is a simplified generative model with proxies.

In this setting, Veitch et al. (2021) show that the input 𝐱𝐱\mathbf{x}bold_x can then be decomposed into three pieces, illustrated in Figure 9(a):

  • 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, or parts of 𝐱𝐱\mathbf{x}bold_x that are not causal descendants of the senstiive attributes 𝐬𝐬\mathbf{s}bold_s. Importantly, 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT includes portions of 𝐱𝐱\mathbf{x}bold_x that are causal descendants of 𝐲𝐲\mathbf{y}bold_y, but not of 𝐬𝐬\mathbf{s}bold_s.

  • 𝐱𝐲subscript𝐱superscript𝐲perpendicular-to\mathbf{x}_{\mathbf{y}^{\perp}}bold_x start_POSTSUBSCRIPT bold_y start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, or parts of 𝐱𝐱\mathbf{x}bold_x that are not causal descendants of 𝐲𝐲\mathbf{y}bold_y.

  • 𝐱𝐲𝐬subscript𝐱𝐲𝐬\mathbf{x}_{\mathbf{y}\wedge\mathbf{s}}bold_x start_POSTSUBSCRIPT bold_y ∧ bold_s end_POSTSUBSCRIPT, or parts of 𝐱𝐱\mathbf{x}bold_x that are causal descendents of both 𝐲𝐲\mathbf{y}bold_y and 𝐬𝐬\mathbf{s}bold_s

The features 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are often considered “core” features, since they are not causally influenced by the sensitive attributes 𝐬𝐬\mathbf{s}bold_s. Veitch et al. (2021) show that predictors that are only a function of 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT can exhibit desirable counterfactual invariance properties. Correspondingly, the features 𝐱𝐲subscript𝐱superscript𝐲perpendicular-to\mathbf{x}_{\mathbf{y}^{\perp}}bold_x start_POSTSUBSCRIPT bold_y start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are often considered “spurious” features, since they have no causal relationship to the labels 𝐲𝐲\mathbf{y}bold_y. Finally, the intersection information 𝐱𝐲𝐬subscript𝐱𝐲𝐬\mathbf{x}_{\mathbf{y}\wedge\mathbf{s}}bold_x start_POSTSUBSCRIPT bold_y ∧ bold_s end_POSTSUBSCRIPT entangles information about both 𝐲𝐲\mathbf{y}bold_y and 𝐬𝐬\mathbf{s}bold_s; as we show below, this can introduce complications for data balancing mitigations.

Cases Where Balancing Proxies is Needed

We now provide a formal example highlighting the potential role of proxies in the data balancing algorithm introduced in the paper. For simplicity, in this argument we will assume that there is no intersection information 𝐱𝐲𝐬subscript𝐱𝐲𝐬\mathbf{x}_{\mathbf{y}\wedge\mathbf{s}}bold_x start_POSTSUBSCRIPT bold_y ∧ bold_s end_POSTSUBSCRIPT, because it is sufficient to show how proxy variables can complicate data balancing. Veitch et al. (2021) call this the “purely spurious” case.

As shown in Makar et al. (2022), in this setting, data balancing can be sufficient to achieve desirable predictive invariances, if there are no uncontrolled confounders. However, if there are proxy varaibles 𝐰𝐰\mathbf{w}bold_w that are (1) correlated with the sensitive attributes 𝐬𝐬\mathbf{s}bold_s, and (2) correlated (causally or not) with the “core” features 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, these can operate as confounders that introduce correlations between sensitive attributes 𝐬𝐬\mathbf{s}bold_s and optimal predictors f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) designed to predict labels 𝐲𝐲\mathbf{y}bold_y from inputs 𝐱𝐱\mathbf{x}bold_x. Importantly, this can happen even when labels 𝐲𝐲\mathbf{y}bold_y and sensitive attributes 𝐬𝐬\mathbf{s}bold_s are marginally independent, as we might achieve with data balancing.

To see how this can occur, consider the DAG in Figure 9(b) where we introduce proxies 𝐰𝐰\mathbf{w}bold_w with a bidirectional arrow connecting them to 𝐬𝐬\mathbf{s}bold_s. Here, we assume that there is no bidirectional edge connecting 𝐲𝐲\mathbf{y}bold_y to 𝐬𝐬\mathbf{s}bold_s, or connecting 𝐲𝐲\mathbf{y}bold_y to 𝐰𝐰\mathbf{w}bold_w, due to data balancing. 222In practice, after data balancing, there can be an edge connecting 𝐲𝐲\mathbf{y}bold_y to 𝐰𝐰\mathbf{w}bold_w, which would imply that there is an open path between 𝐲𝐲\mathbf{y}bold_y and 𝐬𝐬\mathbf{s}bold_s, even though there is no correlation between 𝐲𝐲\mathbf{y}bold_y and 𝐬𝐬\mathbf{s}bold_s because these correlations are made to cancel by the data balancing algorithm. This complication is out of scope for this motivating example. Even though these edges are absent, the 𝐬𝐬\mathbf{s}bold_s𝐰𝐰\mathbf{w}bold_w edge creates an open path between 𝐬𝐬\mathbf{s}bold_s and X𝐬subscript𝑋superscript𝐬perpendicular-toX_{\mathbf{s}^{\perp}}italic_X start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. This path implies that 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, the features that are causally influenced by 𝐲𝐲\mathbf{y}bold_y, but not by 𝐬𝐬\mathbf{s}bold_s, can nonetheless be correlated with 𝐬𝐬\mathbf{s}bold_s. Any reasonable f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) will incorporate information from the core features 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, so this correlation implies that f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) will have some dependence on 𝐬𝐬\mathbf{s}bold_s. Thus, conditioning on 𝐬𝐬\mathbf{s}bold_s can change the distribution of 𝐱𝐬subscript𝐱superscript𝐬perpendicular-to\mathbf{x}_{\mathbf{s}^{\perp}}bold_x start_POSTSUBSCRIPT bold_s start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, resulting in a different conditional expectation E[f(𝐱)𝐬]𝐸delimited-[]conditional𝑓𝐱𝐬E[f(\mathbf{x})\mid\mathbf{s}]italic_E [ italic_f ( bold_x ) ∣ bold_s ] across levels of 𝐬𝐬\mathbf{s}bold_s.

Note that this dependence path is cut off if the edge between 𝐬𝐬\mathbf{s}bold_s and 𝐰𝐰\mathbf{w}bold_w is eliminated, e.g., by balancing proxies 𝐰𝐰\mathbf{w}bold_w with respect to sensitive attributes 𝐬𝐬\mathbf{s}bold_s, as done by proxies. This is the primary motivation for this approach at a population level. However, there are a number of potential complications to this argument. Chief among these is the need to balance on an exhaustive set of proxy variables to eliminate the backdoor path described above. Notably, there is no guarantee that balancing an incomplete set of proxies will reduce dependence; indeed, in some cases the backdoor dependence can increase, a phenomenon known as “bias amplification” in the causal inference literature (see, e.g., Middleton et al., 2016). In addition, this argument is made a the population level, but in practice, finite sample complications in optimization or sampling variability may dominate the benefits of proxy balancing in terms of association bias.

A.6 Model Quality Analysis

To understand the discrepancy of model quality on classification and retrieval tasks, we conduct a in-depth study on the data distribution before and after debiasing. More specifically, we analyze the distributions on image modality (pseudo object labels, OCRs…), text modality (text length, word frequency, Part-Of-Speech tags…) and image-text cross-modality (image-text similarity). Surprisingly, debiasing doesn’t change most distributions, except the pseudo object labels (Figure 10), which are annotated by the open-vocabulary detector OWL-ViT  (Minderer et al., 2022b) and reduce around 25% “adult” after the debiasing.

Refer to caption
Figure 10: The distribution changes of top-50 pseudo object labels before and after debiasing.

We further analyze the test sets of ImageNet-ILSRCV2012 (Deng et al., 2009) and COCO Captions (Chen et al., 2015), and find that the majority of ImageNet images are not associated with human-related labels. On the contrary, about 40% COCO images have captions mentioning humans (Table 7).

Table 7: Example COCO captions with human mentions.
Keyword

Caption with human mentions

adult

Group of adults at outdoor gathering on clear day.

boy

A little boy sitting on a wooden bench eating half a sandwich.

bride

A bride is standing on the grass under an umbrella.

child

A woman playing catch with her young child.

children

Children are playing soccer on a field with several adults observing nearby.

daughter

The mom is teaching her daughter to play baseball

elderly

A woman and an elderly woman sitting at a desk together in a classroom setting with notebooks and pencils on the desk.

father

Baby girl feeding her father some pink and white birthday cake.

female

a female in a white shirt is playing tennis

gentleman

Two gentleman in formal suits, one of them is adjusting the collar of the other.

girl

two girls in red shirts grass and a baseball glove

groom

The cat grooms itself atop the workshop table.

lady

a lady sitting in a half shell hut

male

A male cook standing by the kitchen sink.

man

a man with a bat looks into the air at a pop up

men

a couple of men that are walking around on some grass

mom

A dog and a cat share a moment

mother

A mother elephant and her calf touching trunks in the tall grass

people

some people on some grass playing frisbee and trees

player

A black and white photo of baseball players looking for a ball

sister

Two sisters watch as their kids decorate a cake

son

Mom gives her daughter a lesson in using her baseball glove.

teen

Three teenagers are throwing a frisbee back and forth in a field.

woman

A woman on a tennis court serves the ball.

To verify the hypothesis that the model quality change is primarily caused by the human distribution shift in the training data, we randomly drop examples (instead of relying on the debiasing algorithm) in which humans occur in either image or text. In Figure 11, drop** more human examples leads to higher classification but lower image-text retrieval numbers, which perfectly reproduces the debiasing effect on the model quality.

Refer to caption
Refer to caption
Figure 11: The percentage of human training examples (mentioning human in image or text) affects the model performance on both classification (INet 10-shot and 0-shot) and retrieval (COCO and Flickr 0-shot) tasks. All numbers are the mean values of 3 runs.
Furthermore, we neutralize the human mentions in text (e.g. by replacing “female”/ “gentlemen” / “she” with “person” / “people” / “the person”), in order to preserve training examples (i.e. avoid drop** human examples) and mitigate the side-effect of debiasing on model quality. Figure 12 demonstrates the feasibility of such an approach, and we would like to leave this as a future work. Refer to caption
Figure 12: Human neutralization minimizes the influence to model quality.

A.7 Additional Figures

A.7.1 Representation Bias

Figures 13 and 14 report the median scores (as opposed to the means) for the analysis of Section 4.1.

Refer to caption
Figure 13: Median parity 𝔼[p(man)p(woman)]𝔼delimited-[]𝑝man𝑝woman\mathbb{E}[p(\mathrm{man})-p(\mathrm{woman})]blackboard_E [ italic_p ( roman_man ) - italic_p ( roman_woman ) ] across images from the ILSRCV2012 dataset (Deng et al., 2009). See Figure 2. Values closer to zero are better.
Refer to caption
Figure 14: Similar to Figure 3 but using the median scores, instead of the mean.

A.7.2 Association Bias

Figures 15 and 16 report the median association bias scores (as opposed to the means) for the analysis of Section 4.2.

Refer to captionRefer to captionRefer to caption
Figure 15: Reporting the median bias scores instead of the means for association bias. See Figure 4 and Section 4.2.
Refer to caption
Refer to caption
Refer to caption
Figure 16: A summary of how CLIP learns or unlearns association bias (shown in y𝑦yitalic_y-axis) when intervened data comprises different percentages [%] of training duration. Here, we report the median bias scores across occupations instead of the mean, which was reported in Figure 5

A.7.3 Model Quality

Figures 1718, and 19 report the full model quality evaluation results in terms of few-shot classification, zero-shot classification, and retrieval.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 17: Full 10-shot evaluation results across the nine datasets in Table 1.
Refer to caption
Refer to caption
Refer to caption
Figure 18: Full zero-shot evaluation results across the three datasets in Figure 6.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 19: Full retrieval results across the two datasets in Figure 6.

A.7.4 Encoding Sensitive Attributes

Figure 20 reports the full evaluation results of predicting sensitive attributes. See Section 4.3 for details.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 20: Full evaluations results for predicting sensitive attributes in Fairface, UTKFace, and MIAP, disaggregated by perceived gender. The y𝑦yitalic_y-axis the mean log-loss (lower is better). See Section 4.3 for details.