BiasDora: Exploring Hidden Biased Associations
in Vision-Language Models

Chahat Raj¹ Anjishnu Mukherjee¹ Aylin Caliskan² Antonios Anastasopoulos¹ Ziwei Zhu¹
¹George Mason University, ²University of Washington
{craj,amukher6,antonis,ziwei}@gmu.edu [email protected]

Abstract

Existing works examining Vision Language Models (VLMs) for social biases predominantly focus on a limited set of documented bias associations, such as gender $\leftrightarrow$ profession or race $\leftrightarrow$ crime. This narrow scope often overlooks a vast range of unexamined implicit associations, restricting the identification and, hence, mitigation of such biases. We address this gap by probing VLMs to (1) uncover hidden, implicit associations across 9 bias dimensions. We systematically explore diverse input and output modalities and (2) demonstrate how biased associations vary in their negativity, toxicity, and extremity. Our work (3) identifies subtle and extreme biases that are typically not recognized by existing methodologies. We make the Dataset of retrieved associations, (Dora), publicly available.¹¹1Data and code are available here https://github.com/chahatraj/BiasDora

1 Introduction

Despite the transformative potential of Vision-Language Models (VLMs) across many domains, mounting evidence underscored their risks to perpetuate and exacerbate social biases Wan et al. (2024); Sathe et al. (2024), from reinforcing gender stereotypes by associating women with specific professions Wan and Chang (2024) to marginalizing minority communities by linking people of color with negative connotations Ghosh and Caliskan (2023). Towards this, several bias evaluation methods have been designed Caliskan et al. (2017); Nadeem et al. (2021); Howard et al. (2024); Smith et al. (2022); Hall et al. (2023).

However, a critical limitation of existing evaluation methods is that they heavily rely on predefined associations like and Wan and Chang (2024), remarkably narrowing their scope. The lists of associations²²2The terms ‘biases’ and ‘associations’ are used interchangeably in this paper. in existing works represent just the tip of the iceberg in the vast spectrum of real-world biases. While most recent studies focus on evaluating occupational biases across different genders Seshadri et al. (2023), Bansal et al. (2022) investigate text-to-image models across professions depicted through descriptors. Naik and Nushi (2023); Bianchi et al. (2023); Mandal et al. (2023a) explore biases in the associations between people, occupations, traits, and objects, though constrained by a finite and predefined set of associations. It is also impractical to exhaustively list all potential associations due to the immense effort required from domain experts.

More importantly, the ultimate goal in assessing social biases in VLMs is to uncover all hidden biases within these models that can potentially harm individuals and society, not merely to confirm already known biases. Models may harbor biases that differ from those recognized by humans. There is an overlap between real-world biases and those inherent in VLMs (Figure 1), yet there is also a substantial portion of biases unique to VLMs that remain unexplored.

Refer to caption — Figure 1: VLMs reinforce biases that are different from the documented stereotypical associations.

Hence, in this work, we develop a holistic framework to automatically discover associations representing hidden and detrimental biases in VLMs. The proposed framework is structured as a three-step pipeline (Figure 2). We first uncover bias in three paradigms of VLMs through three carefully designed tasks: a word completion task for studying biases in processing text (text-to-text); an image generation task for gauging biases in generating images (text-to-image); and an image description task for revealing biases in processing images (image-to-text). Following this VLM probing step, we further propose an association salience measuring method and a bias level examining stage to collect all statistically significant and detrimental associations in VLMs. This stage isolates these harmful biases yet might overlook less negative biases. Hence, we also incorporate an LLM-based bias level assessment approach to detect subtler, less obvious, yet problematic associations.

We apply our analysis framework to investigate biases in various state-of-the-art VLMs under different modalities: GPT-4o OpenAI et al. (2023) and Llama-3-8B-Instruct AI (2023) for text-to-text, DALL-E 3 OpenAI (2024) and Stable Diffusion Rombach et al. (2021) for text-to-image, GPT-4o and LLaVA Liu et al. (2023) for image-to-text. We obtain several interesting insights: (1) for the same modality, different models yield different associations. (2) When comparing different modalities, either using the same or different models, we observe distinct associations. (3) We identify associations aligning with real-world biases and common sense, yet have not been discussed in any prior studies. (4) We also uncover stereotypical associations that may not correspond to real-world biases or common sense, and these, too, have been overlooked in prior research.

2 VLM Probing

We propose an extensive probing framework spanning three modalities: Text-to-Text (T2T), Text-to-Image (T2I), and Image-to-Text (I2T). We utilize the CrowS-Pairsdataset Nangia et al. (2020) to identify entities across 400 descriptors and nine demographic dimensions: age (AG), disability (DA), gender (GE), nationality (NT), physical appearance (PA), race/color (RC), religion (RE), sexual orientation (SO), and socio-economic status (SE). This yields approximately 400 demographic descriptors.

2.1 Text-to-Text

In T2T probing, we explore the biases present in models when processing textual inputs and generating text. We design a word completion task to elicit word generation by supplying models with only the initial letter. An example "This pierced person is a t_" is shown in Figure 2. We utilize five different templates to explore stereotypical associations through lexical nuances (Appendix A.8). Each template targets distinct bias manifestations: Singular descriptor focuses on individual entities, Plural descriptor on community stereotypes Bi et al. (2023), Adjective description on traits Mandal et al. (2023b), Noun description on roles Wan and Chang (2024), and Verb description on actions. This design captures the varied ways biases manifest. Models are prompted 10 times to generate words starting with each letter of the English alphabet, creating 26 associated words per descriptor for each template variant. This approach isolates implicit stereotypes Caliskan et al. (2017), yielding insights unaffected by contextual information.

2.2 Text-to-Image

Image Generation. The T2I probing aims to examine biases in models when understanding textual inputs and generating corresponding images. We employ two template variants to examine biases in image generation involving singular and plural descriptors (Appendix A.9). The models are prompted ten times to generate images for each descriptor (Figure 2) without any specifics about the descriptors’ attributes, activities, attire, or other contextual elements, allowing us to assess the presence of stereotypical associations that may be inherently reflected during the image generation process. An example “Generate an image of a pierced person” is given in Figure 2.

Objective Description. Next, we convert these images to text to extract associations (Figure 2) and analyze the biases embedded in visual content. We generate image descriptions using I2T models, prompting these to provide objective, unbiased descriptions Yu and Luo (2024); Fraser et al. (2023). We instruct the models in a one-shot setting, emphasizing to provide factual and observable descriptions, free from any interpretations or prejudices. We experimented with three distinct prompt settings – Straightforward, Moderate, and Complex – in a zero-shot framework, ultimately selecting the most effective approach to ensure unbiased, objective descriptions (Appendix A.10). This ensures that the descriptions are based solely on the visual content, accurately reflecting the biases embedded within the image generation process while minimizing the influence of the text generation models.

2.3 Image-to-Text

In image-to-text (I2T) probing, we aim to uncover the biases models exhibit when processing and understanding image inputs. We assess biases by generating text descriptions for images from Text-to-Image probing using four distinct variations³³3The four settings, Subjective, Stereotypical, Implicit, and Lexical are aimed to generate “subjective” descriptions.: 1) Subjective descriptions eliciting opinions, feelings, or emotions Aoyagui et al. (2024); 2) Identifications of any stereotypical or preconceived notions linked to the image, such as associating laziness or unhealthiness with images depicting obesity Cao et al. (2023); 3) Immediate word or phrase associations to uncover implicit biases Caliskan et al. (2017); Bai et al. (2024a); 4) Combinations of adjectives, nouns, and verbs to detail characteristics, identities, and associated actions of the descriptors Bi et al. (2023); Mandal et al. (2023b).

3 VLM Association Assessment

We collect outputs in text format from all three probing methods for three modalities. To assess biases in text-to-text tasks, we gather word completions for each descriptor; for text-to-image tasks, we collect objective descriptions for generated images of each descriptor; and for image-to-text tasks, we obtain subjective descriptions of input images of each descriptor. We extract salient and impactful associations from these across different modalities.

3.1 Significant Associations

To identify statistically significant biases, we map associations between descriptors and generated words through co-occurrence analysis, quantifying how frequently each descriptor-attribute pair appears across documents. For a descriptor $d$ and a generated word $w$ , we compute the term frequency $\operatorname{tf}(d,w)$ as the times they appear together, and compute the document frequency $\operatorname{df}(w)$ as the times $w$ occurs across descriptors. The final tf-idf score for $(d,w)$ is $\operatorname{tf}(d,w)*\operatorname{idf}(w)$ . We then employ the $p$ -value testing for statistical significance Fisher (1930) at 95% confidence interval, highlighting salient associations from text data across different modalities (Appendix A.4).

3.2 Negative and Toxic Associations

We determine biases through negative and toxic associations in descriptor $\leftrightarrow$ word co-occurrences.

Positve vs. Negative Associations Building on Mei et al. (2023); Bai et al. (2024a); Bi et al. (2023), we employ sentiment analysis⁴⁴4distilbert/distilbert-base-uncased-
finetuned-sst-2-english to discern the positive and negative attitudes exhibited by VLMs, focusing on the word choices used during content generation to reveal their underlying biases towards descriptors. While positive associations may also reinforce stereotypes, our study prioritizes negative associations due to their direct implications for harm and perpetuation of inequities.

Toxic Associations We also examine the toxicity level of identified associations Bi et al. (2023). We identify instances of toxic associations that may not be overtly offensive but could perpetuate subtle biases and negative stereotypes. We use a RoBERTa Liu et al. (2019) model⁵⁵5https://huggingface.co/s-nlp/roberta_toxicity_classifier fine-tuned on 2 million English samples from Jigsaw data Ian Kivlichan (2020) to generate toxicity scores for the statistically significant associations.

3.3 Bias Level Assessment

We employ an LLM-based assessment Zhao et al. (2023a, b) using GPT-4o to evaluate the severity of identified negative stereotypical associations through a question-based prompting task. The model is prompted to rate the problematic nature of bias of a given association on a 5 point Likert scale⁶⁶6Likert scale: 1 $=$ Not at all biased, 2 $=$ Slightly biased, 3 $=$ Moderately biased, 4 $=$ Highly biased, 5 $=$ Extremely biased Likert (1932). This analysis targets the pool of statistically significant associations, aiming to quantitatively measure bias levels and categorize them into extreme, moderate, or subtle biases. The purpose of this assessment is to identify not necessarily negative or toxic associations but potentially problematic stereotypes that go undiscovered in the prior phases.

4 Empirical Analysis

We apply the proposed analysis framework to discover associations from various VLMs under different modalities: GPT-4o and Llama-3-8B for text-to-text, DALL-E 3 and Stable Diffusion for text-to-image, GPT-4o and LLaVA for image-to-text. In this section, we analyze and compare the identified negative associations, toxic associations, and biased associations across modalities, models, and demographic axes.

4.1 Negative Stereotypical Associations

We find a wide diversity of negative associations across models, and modalities, including many not studied before. For the same modalities, we identify distinct associations across various models. We also observe distinct associations when comparing different modalities across models.

GPT-4o displays a higher percentage of negative associations than Llama in word completions (T2T). GPT-4o exhibits a higher percentage of negative biases than Llama in several bias dimensions, like sexuality in singular, plural, and noun settings (Figure 3). Examples of negatively biased associations include and (Appendix A.12). Llama demonstrates a high percentage of negative biases across physical appearance and sexuality. Llama mentions and (Appendix A.13). Despite Llama demonstrating substantial biases, they remain generally less than those found in GPT-4o. The biases pervade across different parts of speech, including adjectives, nouns, and verbs with high negative sentiment. Gender and nationality demonstrate the lowest percentage of biases than the rest of the dimensions, still being above 20% negative sentiment. However, within this low percentage, several nationality/religion-based toxic associations have been found like and particularly offensive towards the descriptors.

DALL-E 3 exhibits negligible gender and sexuality biases for image generations (T2I). No substantial differences in bias percentages are observed between DALL-E 3 and Stable Diffusion across all bias dimensions (Figure 4). DALL-E 3 demonstrates negligible biases in image generation pertaining to gender and sexuality, with nearly zero percent negative biases, whereas Stable Diffusion has a higher percentage of negative gender representations. However, some associations (Figure 5), such as and reinforce negative stereotypes linked to ethnicity, profession, and physical appearance.

Model	Associations
GPT-4o	White person $\leftrightarrow$ oppressor, arab $\leftrightarrow$ liar, italian $\leftrightarrow$ kidnapper, jew $\leftrightarrow$ thief, mexican $\leftrightarrow$ rude, ghetto person $\leftrightarrow$ vulgar, White people $\leftrightarrow$ oppressors, White people $\leftrightarrow$ evil, White people $\leftrightarrow$ violent, demented person $\leftrightarrow$ dangerous, fat person $\leftrightarrow$ gluttonous, Cisgender $\leftrightarrow$ xenophobic, chinese $\leftrightarrow$ rude, jew $\leftrightarrow$ awful, overeducated person $\leftrightarrow$ narcissistic, police $\leftrightarrow$ corrupt, trash collector $\leftrightarrow$ foul, trash collector $\leftrightarrow$ awful, fat person $\leftrightarrow$ ugly, Retards $\leftrightarrow$ virus, italian $\leftrightarrow$ villain, mexican $\leftrightarrow$ villain, white person $\leftrightarrow$ racist, garbage man $\leftrightarrow$ nuisance, Mentally ill person $\leftrightarrow$ killing, drug addict $\leftrightarrow$ abusing, fat person $\leftrightarrow$ quitting, White person $\leftrightarrow$ oppressing, ugly person $\leftrightarrow$ embarrassing, white person $\leftrightarrow$ urinating, atheist $\leftrightarrow$ sinning, ghetto person $\leftrightarrow$ pim**
Llama-3-8B	boy $\leftrightarrow$ terrible, AIDS person $\leftrightarrow$ gay, psychotic kid $\leftrightarrow$ killer, psychotic kid $\leftrightarrow$ violent, fat person $\leftrightarrow$ glutton, fat person $\leftrightarrow$ dumpy, sickly person $\leftrightarrow$ tyrant, White person $\leftrightarrow$ racist, ugly person $\leftrightarrow$ repulsive, ghetto person $\leftrightarrow$ black, rich brat $\leftrightarrow$ trash, bougie person $\leftrightarrow$ trashy, gambler $\leftrightarrow$ sucker, Men $\leftrightarrow$ brutal, Men $\leftrightarrow$ insensitive, Teenagers $\leftrightarrow$ brats, Psychotic kids $\leftrightarrow$ creeps, Psychotic kids $\leftrightarrow$ violent, White people $\leftrightarrow$ violent, White people $\leftrightarrow$ hostile, Morbidly obese people $\leftrightarrow$ narcissistic, White Americans $\leftrightarrow$ hostile, White Americans $\leftrightarrow$ dominant, Prostitutes $\leftrightarrow$ dirty, Posh people $\leftrightarrow$ pretentious, old person $\leftrightarrow$ dirty, man $\leftrightarrow$ violent, psychotic kid $\leftrightarrow$ hostile, no legs person $\leftrightarrow$ worm, white person $\leftrightarrow$ evil, white person $\leftrightarrow$ violent, rich brat $\leftrightarrow$ vulgar, boyfriend $\leftrightarrow$ terror, demented person $\leftrightarrow$ villain, demented person $\leftrightarrow$ pervert, fat person $\leftrightarrow$ sack, skinny person $\leftrightarrow$ skeleton, skinny person $\leftrightarrow$ worm, tiny person $\leftrightarrow$ bug, dark-skinned person $\leftrightarrow$ slave, welfare money person $\leftrightarrow$ parasite, landlord $\leftrightarrow$ jerk, landlord $\leftrightarrow$ bully, landlord $\leftrightarrow$ tyrant, landlord $\leftrightarrow$ scoundrel, drug addict $\leftrightarrow$ abusing

Table 1: Associations identified in GPT-4o and Llama-3-8B are toxic and demeaning.

LLaVA demonstrates a higher percentage of biases than GPT-4o for image descriptions (I2T). Similar patterns emerge, with LLaVA showing a greater frequency of negative sentiments than GPT-4o across most bias dimensions, especially in subjective and stereotypical settings (Figure 4). While gender and sexuality biases are less pronounced in GPT-4o, they are nearly zero in both GPT-4o and LLaVA for implicit and lexical settings. Yet, close to 20% sexuality biases are observed in GPT-4o when measured in an implicit setting. However, biases related to disability in GPT-4o and physical appearance in LLaVA remain pronounced across various lexical settings. Examples of biased subjective descriptions include and Several stereotypical associations have also been identified across sexuality, disability, and gender. Some problematic associations are and

4.2 Toxic Associations

We discover several toxic associations in generations from T2T models. T2I and I2T models reflect low toxicities.

GPT-4o and Llama word completions consistently reflect toxicity towards disability and sexual orientation (T2T). GPT-4o consistently exhibits higher toxicity percentages than Llama, suggesting a greater tendency for generating toxic language (Figure 6). This is particularly evident for sexual orientation, where the toxicity scores of GPT-4o surpass those of Llama across all settings. Conversely, both models exhibit negligible toxicity in the dimension of age, however, Llama marginally exceeds GPT-4o in this category. Gender toxicity scores are also minimal. Disability has notably high toxicity levels, with both models registering scores predominantly above 20%, marking it as the second highest dimension observing toxicity. Llama associates and while GPT connects and (Table 1). Physical appearance, religion and socioeconomic status show a consistent degree of toxicity across both models and all settings examined. Further analysis of the generations reveals deeply troubling associations. LLaMA links and while GPT associates and demonstrating inherent toxic inclinations. Overall, low toxicity scores are observed across I2T settings for both models except for 16% gender toxicity in LLaVA.

4.3 Bias level assessment

We examine the levels of how problematic the generated associations are using LLM-based bias assessment across the nine bias dimensions. We also discover associations that align with real-world biases and common sense yet have not been discussed in any of the prior studies.

Disability, appearance, and race/color dimensions note high to extreme biases in word completions. Both GPT-4o and Llama demonstrate similar proportions of biases across all categories and dimensions, (Figure 7). Notably, the singular setting in both models presents more biased associations than the plural setting. GPT-4o exhibits a high percentage of extreme biases in physical appearance, religion, disability, and race/color. Llama also shows pronounced biases in these dimensions, with race/color and physical appearance associations being notably problematic. For nationality and physical appearance, biases are generally skewed towards the slightly biased end of the scale, although Llama records higher levels in these categories. Gender associations in both models are predominantly at the “slightly” or “not at all” biased ends, with Llama recording higher biases than GPT-4o. Similarly, associations with sexual orientation in the plural setting are largely unbiased. Socioeconomic associations tend to be slight to moderately biased, with age biases in GPT-4o predominantly categorized as slightly biased or not biased at all. In verb settings, GPT-4o generally shows lower frequencies of extreme biases, contrasting with Llama, which exhibits notable biases in disability, race/color, and sexuality. Overall, the analysis of noun settings reveals high frequencies of biased associations, particularly in disability and appearance dimensions, across both models.

Sexuality and gender biases are more pronounced in image generations. Image generation models like DALL-E 3 and Stable Diffusion exhibit slight to moderate biases across various dimensions, with a moderate bias level specifically in gender image generation, Figure 7. The most pronounced biases, appearing on the extreme end, are in dimensions of sexuality, race/color, and appearance for both models. Several depictions associate descriptors with stereotypical occupations, activities, objects, and attire (Figure 5). Image generations sampled from DALL-E 3 and Stable Diffusion demonstrate previously discovered gender biases like , and . The novel associations we find include interesting associations such as and and are examples of some object-specific associations. These stereotypical and potentially problematic depictions of descriptors are often overlooked in sentiment and toxicity analysis but are captured through the bias-level assessment.

Subjective and stereotypical image descriptions capture biased associations in gender, sexuality, and race/color. In image description tasks, stereotypes are spread across different bias levels, with Llama showing minimal gender biases and GPT-4o displaying few highly biased associations in all settings, Figure 7. Biases related to religion and sexual orientation are also relatively low. The stereotypical and subjective settings frequently capture biased associations, typically ranging from slight to high bias levels. Subjective descriptions often show extreme biases for physical appearance in the GPT-4o model and across disability, nationality, race/color, physical appearance, and sexual orientation in the Llama model. The most concerning stereotypes are found in gender, physical appearance, and race/color dimensions. Stereotypical associations are notably present in gender, race/color, and sexual orientation. Implicit associations display significant biases in gender and sexual orientation for GPT-4o and in disability and nationality for Llama. Lexical settings tend to show moderate biases generally but exhibit high biases in nationality, appearance, and race/color.

4.4 Discovered Associations

We discuss previously undiscovered associations identified by our method, highlighting biases overlooked by prior studies. We also uncover associations that do not align with real-world biases or common sense and that have not been addressed in any previous research.

People from different age groups are reflected negatively from distinct perspectives. We see distinct patterns of stereotypes in GPT-4o and Llama outputs. Starting with the “Age” category, Llama generates associations like and highlighting negative stereotypes associated with aging. Conversely, GPT-4o portrays suggesting a stereotype of financial instability among young adults. Other associations like and present undocumented associations.

Diverse genders and sexualities are portrayed negatively. Llama associates indicating a harmful stereotype of mental instability linked to non-heteronormative identities. Similarly, GPT-4o associates which emphasizes a sense of crisis or disorder. These portrayals reflect a severe bias in how gender and sexual identities are perceived.

Models generate unusual associations. GPT-4o frequently repeats associations such as “xenophobic” or “zealous” across various descriptors, indicating a limitation in generating diverse vocabulary and mirroring both widespread real-world biases and less commonly recognized stereotypes.

Stereoty** nationalities with criminal or anti-national activities. The Nationality dimension reveals deeply entrenched biases, with models reflecting severe cultural and racial prejudices (Table 2). Llama generates associations like and GPT-4o associates and showcasing the problematic stereoty** of national identities.

Stereoty** humans as animals. The outputs concerning nationality and race/color are particularly demeaning (Table 3); Llama generates which is highly offensive and dehumanizing. GPT-4o shows associations like , which still perpetuate racial bias by likening people to animals. Another association by Llama, links skin color with socioeconomic status. These unusual associations highlight the model’s tendency to generate significant and detrimental biases that are not commonly perceived by humans and are, thus, hard to identify.

5 Related Work

Existing works study social biases in VLMs using already established bias vocabulary. Works such as WEAT Caliskan et al. (2017) define target and attribute pairs to study biases in a limited environment. Wang et al. (2023) assess multimodal implicit biases in generative models building on this defined list of concepts and targets. Mandal et al. (2023b) study image biases across bias dimensions using yet another limited vocabulary of associated adjectives. However, using such predefined biased associations limits the scope of identifying hidden biases VLMs can generate. Recent research Bai et al. (2024b) is progressing towards identifying model associations to cover the scope of biases that these models perpetuate widely.

Recent studies in social biases within vision-language models (VLMs) and large language models (LLMs) employ diverse methodologies to highlight and analyze biases across various social categories. Manerba et al. (2023) utilize the Social Bias Probing framework, applying a large-scale dataset and a perplexity-based fairness score to uncover extensive social biases, particularly in relation to religion and gender. Similarly, Bai et al. (2024a) introduce innovative prompt-based methods—LLM Implicit Bias and LLM Decision Bias—that reveal significant implicit biases which mirror societal stereotypes in categories like race, gender, and health. These findings suggest that biases in LLMs are more pronounced than previously understood from traditional benchmarks.

Model

Associations

Llama

psychotic kid

\leftrightarrow

warlock, iraqi

\leftrightarrow

terrorist, white person

\leftrightarrow

traitor, muslim

\leftrightarrow

jihadist, baptist

\leftrightarrow

extremist, Armenians

\leftrightarrow

genocide, Caucasians

\leftrightarrow

hostile, psychotic kid

\leftrightarrow

terrorist, demented person

\leftrightarrow

terrorist, iraqi

\leftrightarrow

terrorist, korean

\leftrightarrow

dictator

Table 2: Llama stereotypes entities as anti-national.

Model	Associations
Llama	sickly person $\leftrightarrow$ dog, black person $\leftrightarrow$ zebra, gay $\leftrightarrow$ parrot, gay $\leftrightarrow$ cat, fat person $\leftrightarrow$ cat, no legs person $\leftrightarrow$ worm, sickly person $\leftrightarrow$ dog, african $\leftrightarrow$ ostrich, african $\leftrightarrow$ beast, hunchback person $\leftrightarrow$ troll, fat person $\leftrightarrow$ elephant, gay $\leftrightarrow$ animal
GPT-4o	fat person $\leftrightarrow$ whale, fat person $\leftrightarrow$ cow, fat person $\leftrightarrow$ zombie, fat person $\leftrightarrow$ mammal, no legs person $\leftrightarrow$ worm, african $\leftrightarrow$ ostrich, obese person $\leftrightarrow$ whale, large fat person $\leftrightarrow$ pig, morbidly obese person $\leftrightarrow$ elephant, ugly person $\leftrightarrow$ troll

Table 3: T2T models frequently compare humans with animals in a derogatory light.

Howard et al. (2024) assess social biases in VLMs by examining text generated from counterfactually altered input images, focusing on stereotypes associated with race, gender, and physical characteristics. Kamruzzaman et al. (2023) propose methodologies for detecting subtle biases by analyzing associations between social attributes such as age, beauty, and nationality, revealing significant and generalized biases that are often overlooked. Moreover, Our work, in line with these recent advances creates a benchmark in identifying previously uncovered biased associations.

6 Conclusion

We identify previously overlooked biased associations in VLMs across T2T, T2I, and I2T paradigms through word completions, image generations, and objective and subjective image description tasks. We gain several insights as to how these biases vary across distinct bias dimensions for a given modality. We observe several biased associations for each modality for different VLMs. We discover several associations across three modalities that align to real-world biases following common sense that are not discussed by prior works. We also discover stereotypical associations that do not align to real-word biases, yet, perpetuate within these models.

Limitations

Objective setting may not be accurate

Let’s consider the association and For both of these, black may be referring to the clothes that the people in the images are wearing and not necessarily their race. We leave it to future work to figure out a better method to distinguish between these cases.

Stereotype filtering

We currently filter down our long list of extracted associations primarily on the basis of tf-idf scores, which while useful in figuring out a range of scores for the distribution we obtain, has statistical alternatives like Pointwise Mutual Informatoin (PMI) which recent work also uses for similar purposes.

Statistically significant bias

Since we limit our study to focus on statistically significant biases, we are forced to leave out those that are not significant but still potentially harmful.

Quantifying biases

In our work, we use toxicity and sentiment as proxies for quantification of biases. We however encourage future work to develop methods to measure these extracted biases more holistically for VLMs.

LLM based bias evaluation

One of our studies uses LLMs to asses bias level. This approach is, however, vulnerable to the biases that the judge LLM has intrinsically Lin et al. (2024).

Acknnowledgements

We are thankful for feedback from Sunayana Sitaram at an earlier stage of the work. This project was supported by the Microsoft Accelerate Foundation Models Research (AFMR) grant program and partially supported by the National Science Foundation through award IIS-2327143. This work was also supported by the National Institute of Standards and Technology (NIST) Grant 60NANB23D194. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of NIST.

References

AI (2023) Meta AI. 2023. Meta llama 3: Advancing language models with state-of-the-art capabilities. Accessed: 2024-06-17.
Aoyagui et al. (2024) Paula Akemi Aoyagui, Sharon Ferguson, and Anastasia Kuzminykh. 2024. Exploring subjectivity for more human-centric assessment of social biases in large language models.
Bai et al. (2024a) Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2024a. Measuring implicit bias in explicitly unbiased large language models.
Bai et al. (2024b) Yanhong Bai, Jiabao Zhao, **xin Shi, Zhentao Xie, Xingjiao Wu, and Liang He. 2024b. Fairmonitor: A dual-framework for detecting stereotypes and biases in large language models.
Bansal et al. (2022) Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. 2022. How well can text-to-image generative models understand ethical natural language interventions? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1358–1370, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Bi et al. (2023) Guanqun Bi, Lei Shen, Yuqiang Xie, Yanan Cao, Tiangang Zhu, and Xiaodong He. 2023. A group fairness lens for large language models.
Bianchi et al. (2023) Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. 2023. Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1493–1504, New York, NY, USA. Association for Computing Machinery.
Caliskan et al. (2017) Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.
Cao et al. (2023) Yang Trista Cao, Anna Sotnikova, Jieyu Zhao, Linda X. Zou, Rachel Rudinger, and Hal Daume III. 2023. Multilingual large language models leak human stereotypes across language boundaries.
Fisher (1930) R. A. Fisher. 1930. Inverse probability. Mathematical Proceedings of the Cambridge Philosophical Society, 26:528–535.
Fraser et al. (2023) Kathleen C. Fraser, Svetlana Kiritchenko, and Isar Nejadgholi. 2023. A friendly face: Do text-to-image systems rely on stereotypes when the input is under-specified?
Ghosh and Caliskan (2023) Sourojit Ghosh and Aylin Caliskan. 2023. ’person’ == light-skinned, western man, and sexualization of women of color: Stereotypes in stable diffusion. In Conference on Empirical Methods in Natural Language Processing.
Hall et al. (2023) Siobhan Mackenzie Hall, F. Goncalves Abrantes, Hanwen Zhu, Grace A. Sodunke, Aleksandar Shtedritski, and Hannah Rose Kirk. 2023. Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution. ArXiv preprint, abs/2306.12424.
Howard et al. (2024) Phillip Howard, Kathleen C. Fraser, Anahita Bhiwandiwalla, and Svetlana Kiritchenko. 2024. Uncovering bias in large vision-language models at scale with counterfactuals.
Ian Kivlichan (2020) Julia Elliott Lucy Vasserman Martin Görner Phil Culliton Ian Kivlichan, Jeffrey Sorensen. 2020. Jigsaw multilingual toxic comment classification.
Kamruzzaman et al. (2023) Mahammed Kamruzzaman, Md. Minul Islam Shovon, and Gene Louis Kim. 2023. Investigating subtler biases in llms: Ageism, beauty, institutional, and nationality bias in generative models.
Likert (1932) Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of psychology.
Lin et al. (2024) Luyang Lin, Lingzhi Wang, **song Guo, and Kam-Fai Wong. 2024. Investigating bias in llm-based bias detection: Disparities between llms and human perception.
Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.
Mandal et al. (2023a) Abhishek Mandal, Susan Leavy, and Suzanne Little. 2023a. Multimodal composite association score: Measuring gender bias in generative multimodal models. ArXiv preprint, abs/2304.13855.
Mandal et al. (2023b) Abhishek Mandal, Suzanne Little, and Susan Leavy. 2023b. Gender bias in multimodal models: A transnational feminist approach considering geographical region and culture.
Manerba et al. (2023) Marta Marchiori Manerba, Karolina Stańczak, Riccardo Guidotti, and Isabelle Augenstein. 2023. Social bias probing: Fairness benchmarking for language models.
Mei et al. (2023) Katelyn Mei, Sonia Fereidooni, and Aylin Caliskan. 2023. Bias against 93 stigmatized groups in masked language models and downstream sentiment classification tasks. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1699–1710, New York, NY, USA. Association for Computing Machinery.
Nadeem et al. (2021) Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
Naik and Nushi (2023) Ranjita Naik and Besmira Nushi. 2023. Social biases through the text-to-image generation lens. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, page 786–808, New York, NY, USA. Association for Computing Machinery.
Nangia et al. (2020) Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
OpenAI (2024) OpenAI. 2024. Dall·e 3 technical report. https://cdn.openai.com/papers/dall-e-3.pdf. [Accessed: June 9, 2024].
OpenAI et al. (2023) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, and et al. 2023. Gpt-4 technical report.
Rombach et al. (2021) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-resolution image synthesis with latent diffusion models.
Sathe et al. (2024) Ashutosh Sathe, Prachi Jain, and Sunayana Sitaram. 2024. A unified framework and dataset for assessing gender bias in vision-language models.
Seshadri et al. (2023) Preethi Seshadri, Sameer Singh, and Yanai Elazar. 2023. The bias amplification paradox in text-to-image generation. ArXiv preprint, abs/2308.00755.
Smith et al. (2022) Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. 2022. “I’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9180–9211, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Wan and Chang (2024) Yixin Wan and Kai-Wei Chang. 2024. The male ceo and the female assistant: Probing gender biases in text-to-image models through paired stereotype test.
Wan et al. (2024) Yixin Wan, Arjun Subramonian, Anaelia Ovalle, Zongyu Lin, Ashima Suvarna, Christina Chance, Hritik Bansal, Rebecca Pattichis, and Kai-Wei Chang. 2024. Survey of bias in text-to-image generation: Definition, evaluation, and mitigation.
Wang et al. (2023) Jialu Wang, Xinyue Gabby Liu, Zonglin Di, Yang Liu, and Xin Eric Wang. 2023. T2iat: Measuring valence and stereotypical biases in text-to-image generation.
Yu and Luo (2024) Yongsheng Yu and Jiebo Luo. 2024. Chain-of-thought prompting for demographic inference with large multimodal models.
Zhao et al. (2023a) Jiaxu Zhao, Meng Fang, Shirui Pan, Wenpeng Yin, and Mykola Pechenizkiy. 2023a. Gptbias: A comprehensive framework for evaluating bias in large language models.
Zhao et al. (2023b) Yachao Zhao, Bo Wang, Dongming Zhao, Kun Huang, Yan Wang, Ruifang He, and Yuexian Hou. 2023b. Mind vs. mouth: On measuring re-judge inconsistency of social bias in large language models.

Appendix A Appendix

	Closed-Weight Models			Open-Weight Models
	Total Associations	Significant	P-value Significant	Total Associations	Significant	P-value Significant
T2T
Singular	44085	21743	1024	105560	34157	2452
Plural	46034	18967	222	107379	35972	2310
Adjective	43919	20578	1383	105560	34007	2212
Noun	43997	19941	1095	105558	33504	2311
Verb	44057	20480	1506	105560	32154	1828
T2I + I2T
Objective	1519764	136601	5564	2074960	178743	7366
Subjective	2318538	208508	10680	2404260	206897	9978
Stereotypical	1736420	156778	4991	2005110	172200	6432
Implicit	707377	63083	3050	378420	31609	956
Lexical	120187	10664	658	279590	23804	581

Table 4: Count summary of T2T and T2I+I2T Model Associations. Significant associations fall within the standard deviation range. P-value significant results are at 95% confidence intervals.

Generation settings and Computation Budget

•

DALL-E 3 images were generated for vivid and natural settings for standard quality and size $1024$ x $1024$
•

GPT-4o and LLaVA generations were obtained for temperature $=0.7$ , top_p $=0.95$ , no frequency or presence penalty, no stop** condition other than the maximum number of tokens to generate, max_tokens $=200$ .
•

For Stable Diffusion, we use stabilityai/stable-diffusion-2-inpainting from Hugging Face, and replace the autoencoder with stabilityai/sd-vae-ft-mse. We also use a DPMSolverMultistepScheduler for speeding up the generation process. We add "50mm photography, hard rim lighting photography --beta --ar 2:3 --beta --upbeta 0.1 --upnoise 0.1 --upalpha 0.1 --upgamma 0.1 --upsteps 20" to the end of our prompt to get high-quality images.
•

Our total budget for all experiments involving API calls was $1000. This was funded by a grant from Microsoft Azure.
•

For experiments with Llama, LLaVA, Stable Diffusion and the sentiment and toxicity classifiers, we used a single instance of a Multi-Instance A100 GPU with 40GB of GPU memory, 3/7 fraction of Streaming Multiprocessors, 2 NVIDIA Decoder hardware units, 4/8 L2 cache size, and 1 node.

BiasDora: Exploring Hidden Biased Associations in Vision-Language Models