BiasDora: Exploring Hidden Biased Associations
in Vision-Language Models

Chahat Raj1  Anjishnu Mukherjee1  Aylin Caliskan2  Antonios Anastasopoulos1  Ziwei Zhu1
1George Mason University, 2University of Washington
{craj,amukher6,antonis,ziwei}@gmu.edu[email protected]
Abstract

Existing works examining Vision Language Models (VLMs) for social biases predominantly focus on a limited set of documented bias associations, such as gender\leftrightarrowprofession or race\leftrightarrowcrime. This narrow scope often overlooks a vast range of unexamined implicit associations, restricting the identification and, hence, mitigation of such biases. We address this gap by probing VLMs to (1) uncover hidden, implicit associations across 9 bias dimensions. We systematically explore diverse input and output modalities and (2) demonstrate how biased associations vary in their negativity, toxicity, and extremity. Our work (3) identifies subtle and extreme biases that are typically not recognized by existing methodologies. We make the Dataset of retrieved associations, (Dora), publicly available.111Data and code are available here https://github.com/chahatraj/BiasDora

1 Introduction

Despite the transformative potential of Vision-Language Models (VLMs) across many domains, mounting evidence underscored their risks to perpetuate and exacerbate social biases Wan et al. (2024); Sathe et al. (2024), from reinforcing gender stereotypes by associating women with specific professions Wan and Chang (2024) to marginalizing minority communities by linking people of color with negative connotations Ghosh and Caliskan (2023). Towards this, several bias evaluation methods have been designed Caliskan et al. (2017); Nadeem et al. (2021); Howard et al. (2024); Smith et al. (2022); Hall et al. (2023).

However, a critical limitation of existing evaluation methods is that they heavily rely on predefined associations like man\leftrightarrowdoctor and woman\leftrightarrownurse Wan and Chang (2024), remarkably narrowing their scope. The lists of associations222The terms ‘biases’ and ‘associations’ are used interchangeably in this paper. in existing works represent just the tip of the iceberg in the vast spectrum of real-world biases. While most recent studies focus on evaluating occupational biases across different genders Seshadri et al. (2023), Bansal et al. (2022) investigate text-to-image models across professions depicted through descriptors. Naik and Nushi (2023); Bianchi et al. (2023); Mandal et al. (2023a) explore biases in the associations between people, occupations, traits, and objects, though constrained by a finite and predefined set of associations. It is also impractical to exhaustively list all potential associations due to the immense effort required from domain experts.

More importantly, the ultimate goal in assessing social biases in VLMs is to uncover all hidden biases within these models that can potentially harm individuals and society, not merely to confirm already known biases. Models may harbor biases that differ from those recognized by humans. There is an overlap between real-world biases and those inherent in VLMs (Figure 1), yet there is also a substantial portion of biases unique to VLMs that remain unexplored.

Refer to caption
Figure 1: VLMs reinforce biases that are different from the documented stereotypical associations.
Refer to caption
Figure 2: We probe VLMs in three modalities: T2T, T2I & I2T through word completion, image generation, and image description tasks. We calculate statistically significant association followed by identifying sentiment-negative and toxic association. We further evaluate bias levels of these associations using LLM-based assessment.

Hence, in this work, we develop a holistic framework to automatically discover associations representing hidden and detrimental biases in VLMs. The proposed framework is structured as a three-step pipeline (Figure 2). We first uncover bias in three paradigms of VLMs through three carefully designed tasks: a word completion task for studying biases in processing text (text-to-text); an image generation task for gauging biases in generating images (text-to-image); and an image description task for revealing biases in processing images (image-to-text). Following this VLM probing step, we further propose an association salience measuring method and a bias level examining stage to collect all statistically significant and detrimental associations in VLMs. This stage isolates these harmful biases yet might overlook less negative biases. Hence, we also incorporate an LLM-based bias level assessment approach to detect subtler, less obvious, yet problematic associations.

We apply our analysis framework to investigate biases in various state-of-the-art VLMs under different modalities: GPT-4o OpenAI et al. (2023) and Llama-3-8B-Instruct AI (2023) for text-to-text, DALL-E 3 OpenAI (2024) and Stable Diffusion Rombach et al. (2021) for text-to-image, GPT-4o and LLaVA Liu et al. (2023) for image-to-text. We obtain several interesting insights: (1) for the same modality, different models yield different associations. (2) When comparing different modalities, either using the same or different models, we observe distinct associations. (3) We identify associations aligning with real-world biases and common sense, yet have not been discussed in any prior studies. (4) We also uncover stereotypical associations that may not correspond to real-world biases or common sense, and these, too, have been overlooked in prior research.

2 VLM Probing

We propose an extensive probing framework spanning three modalities: Text-to-Text (T2T), Text-to-Image (T2I), and Image-to-Text (I2T). We utilize the CrowS-Pairsdataset Nangia et al. (2020) to identify entities across  400 descriptors and nine demographic dimensions: age (AG), disability (DA), gender (GE), nationality (NT), physical appearance (PA), race/color (RC), religion (RE), sexual orientation (SO), and socio-economic status (SE). This yields approximately 400 demographic descriptors.

2.1 Text-to-Text

In T2T probing, we explore the biases present in models when processing textual inputs and generating text. We design a word completion task to elicit word generation by supplying models with only the initial letter. An example "This pierced person is a t_" is shown in Figure 2. We utilize five different templates to explore stereotypical associations through lexical nuances (Appendix A.8). Each template targets distinct bias manifestations: Singular descriptor focuses on individual entities, Plural descriptor on community stereotypes Bi et al. (2023), Adjective description on traits Mandal et al. (2023b), Noun description on roles Wan and Chang (2024), and Verb description on actions. This design captures the varied ways biases manifest. Models are prompted 10 times to generate words starting with each letter of the English alphabet, creating 26 associated words per descriptor for each template variant. This approach isolates implicit stereotypes Caliskan et al. (2017), yielding insights unaffected by contextual information.

2.2 Text-to-Image

Image Generation. The T2I probing aims to examine biases in models when understanding textual inputs and generating corresponding images. We employ two template variants to examine biases in image generation involving singular and plural descriptors (Appendix A.9). The models are prompted ten times to generate images for each descriptor (Figure 2) without any specifics about the descriptors’ attributes, activities, attire, or other contextual elements, allowing us to assess the presence of stereotypical associations that may be inherently reflected during the image generation process. An example “Generate an image of a pierced person” is given in Figure 2.

Objective Description. Next, we convert these images to text to extract associations (Figure 2) and analyze the biases embedded in visual content. We generate image descriptions using I2T models, prompting these to provide objective, unbiased descriptions Yu and Luo (2024); Fraser et al. (2023). We instruct the models in a one-shot setting, emphasizing to provide factual and observable descriptions, free from any interpretations or prejudices. We experimented with three distinct prompt settings – Straightforward, Moderate, and Complex – in a zero-shot framework, ultimately selecting the most effective approach to ensure unbiased, objective descriptions (Appendix A.10). This ensures that the descriptions are based solely on the visual content, accurately reflecting the biases embedded within the image generation process while minimizing the influence of the text generation models.

Refer to caption
Figure 3: GPT-4o and Llama-3-8B generate a high percentage of negative associations in T2T modality. Each lexical setting captures a distinct level of negative sentiment across the bias dimensions and models. Sexual Orientation and Physical Appearance demonstrate more negative associations than the other dimensions.

2.3 Image-to-Text

In image-to-text (I2T) probing, we aim to uncover the biases models exhibit when processing and understanding image inputs. We assess biases by generating text descriptions for images from Text-to-Image probing using four distinct variations333The four settings, Subjective, Stereotypical, Implicit, and Lexical are aimed to generate “subjective” descriptions.: 1) Subjective descriptions eliciting opinions, feelings, or emotions Aoyagui et al. (2024); 2) Identifications of any stereotypical or preconceived notions linked to the image, such as associating laziness or unhealthiness with images depicting obesity Cao et al. (2023); 3) Immediate word or phrase associations to uncover implicit biases Caliskan et al. (2017); Bai et al. (2024a); 4) Combinations of adjectives, nouns, and verbs to detail characteristics, identities, and associated actions of the descriptors Bi et al. (2023); Mandal et al. (2023b).

3 VLM Association Assessment

We collect outputs in text format from all three probing methods for three modalities. To assess biases in text-to-text tasks, we gather word completions for each descriptor; for text-to-image tasks, we collect objective descriptions for generated images of each descriptor; and for image-to-text tasks, we obtain subjective descriptions of input images of each descriptor. We extract salient and impactful associations from these across different modalities.

3.1 Significant Associations

To identify statistically significant biases, we map associations between descriptors and generated words through co-occurrence analysis, quantifying how frequently each descriptor-attribute pair appears across documents. For a descriptor d𝑑ditalic_d and a generated word w𝑤witalic_w, we compute the term frequency tf(d,w)tf𝑑𝑤\operatorname{tf}(d,w)roman_tf ( italic_d , italic_w ) as the times they appear together, and compute the document frequency df(w)df𝑤\operatorname{df}(w)roman_df ( italic_w ) as the times w𝑤witalic_w occurs across descriptors. The final tf-idf score for (d,w)𝑑𝑤(d,w)( italic_d , italic_w ) is tf(d,w)idf(w)tf𝑑𝑤idf𝑤\operatorname{tf}(d,w)*\operatorname{idf}(w)roman_tf ( italic_d , italic_w ) ∗ roman_idf ( italic_w ). We then employ the p𝑝pitalic_p-value testing for statistical significance Fisher (1930) at 95% confidence interval, highlighting salient associations from text data across different modalities (Appendix A.4).

3.2 Negative and Toxic Associations

We determine biases through negative and toxic associations in descriptor\leftrightarrowword co-occurrences.

Positve vs. Negative Associations Building on Mei et al. (2023); Bai et al. (2024a); Bi et al. (2023), we employ sentiment analysis444distilbert/distilbert-base-uncased-
finetuned-sst-2-english
to discern the positive and negative attitudes exhibited by VLMs, focusing on the word choices used during content generation to reveal their underlying biases towards descriptors. While positive associations may also reinforce stereotypes, our study prioritizes negative associations due to their direct implications for harm and perpetuation of inequities.

Toxic Associations We also examine the toxicity level of identified associations Bi et al. (2023). We identify instances of toxic associations that may not be overtly offensive but could perpetuate subtle biases and negative stereotypes. We use a RoBERTa Liu et al. (2019) model555https://huggingface.co/s-nlp/roberta_toxicity_classifier fine-tuned on 2 million English samples from Jigsaw data Ian Kivlichan (2020) to generate toxicity scores for the statistically significant associations.

3.3 Bias Level Assessment

We employ an LLM-based assessment Zhao et al. (2023a, b) using GPT-4o to evaluate the severity of identified negative stereotypical associations through a question-based prompting task. The model is prompted to rate the problematic nature of bias of a given association on a 5 point Likert scale666Likert scale: 1===Not at all biased, 2===Slightly biased, 3===Moderately biased, 4===Highly biased, 5===Extremely biased Likert (1932). This analysis targets the pool of statistically significant associations, aiming to quantitatively measure bias levels and categorize them into extreme, moderate, or subtle biases. The purpose of this assessment is to identify not necessarily negative or toxic associations but potentially problematic stereotypes that go undiscovered in the prior phases.

Refer to caption
Figure 4: Stable Diffusion has higher bias than GPT-4o in generating gender images. GPT-4o and LLaVA reflect high disability biases in I2T modality.

4 Empirical Analysis

We apply the proposed analysis framework to discover associations from various VLMs under different modalities: GPT-4o and Llama-3-8B for text-to-text, DALL-E 3 and Stable Diffusion for text-to-image, GPT-4o and LLaVA for image-to-text. In this section, we analyze and compare the identified negative associations, toxic associations, and biased associations across modalities, models, and demographic axes.

4.1 Negative Stereotypical Associations

We find a wide diversity of negative associations across models, and modalities, including many not studied before. For the same modalities, we identify distinct associations across various models. We also observe distinct associations when comparing different modalities across models.

GPT-4o displays a higher percentage of negative associations than Llama in word completions (T2T). GPT-4o exhibits a higher percentage of negative biases than Llama in several bias dimensions, like sexuality in singular, plural, and noun settings (Figure 3). Examples of negatively biased associations include gay\leftrightarrowslur, gays\leftrightarrowimmoral, women\leftrightarrowxenophobic, and lesbian\leftrightarrowmaniac (Appendix A.12). Llama demonstrates a high percentage of negative biases across physical appearance and sexuality. Llama mentions pierced person\leftrightarrowgothic, and obese people\leftrightarrownarcissistic (Appendix A.13). Despite Llama demonstrating substantial biases, they remain generally less than those found in GPT-4o. The biases pervade across different parts of speech, including adjectives, nouns, and verbs with high negative sentiment. Gender and nationality demonstrate the lowest percentage of biases than the rest of the dimensions, still being above 20% negative sentiment. However, within this low percentage, several nationality/religion-based toxic associations have been found like armenian\leftrightarrowcriminal, and italian\leftrightarrowkidnapper, particularly offensive towards the descriptors.

DALL-E 3 exhibits negligible gender and sexuality biases for image generations (T2I). No substantial differences in bias percentages are observed between DALL-E 3 and Stable Diffusion across all bias dimensions (Figure 4). DALL-E 3 demonstrates negligible biases in image generation pertaining to gender and sexuality, with nearly zero percent negative biases, whereas Stable Diffusion has a higher percentage of negative gender representations. However, some associations (Figure 5), such as gang member\leftrightarrowgraffiti, clown\leftrightarrowstupid, and obese person\leftrightarrowbench, reinforce negative stereotypes linked to ethnicity, profession, and physical appearance.

Refer to caption
Figure 5: GPT-4o image generations perpetuate stereotypes by associating humans with skin-color, colors, objects, and attributes.
Model Associations

GPT-4o

White person\leftrightarrowoppressor, arab\leftrightarrowliar, italian\leftrightarrowkidnapper, jew\leftrightarrowthief, mexican\leftrightarrowrude, ghetto person\leftrightarrowvulgar, White people\leftrightarrowoppressors, White people\leftrightarrowevil, White people\leftrightarrowviolent, demented person\leftrightarrowdangerous, fat person\leftrightarrowgluttonous, Cisgender\leftrightarrowxenophobic, chinese\leftrightarrowrude, jew\leftrightarrowawful, overeducated person\leftrightarrownarcissistic, police\leftrightarrowcorrupt, trash collector\leftrightarrowfoul, trash collector\leftrightarrowawful, fat person\leftrightarrowugly, Retards\leftrightarrowvirus, italian\leftrightarrowvillain, mexican\leftrightarrowvillain, white person\leftrightarrowracist, garbage man\leftrightarrownuisance, Mentally ill person\leftrightarrowkilling, drug addict\leftrightarrowabusing, fat person\leftrightarrowquitting, White person\leftrightarrowoppressing, ugly person\leftrightarrowembarrassing, white person\leftrightarrowurinating, atheist\leftrightarrowsinning, ghetto person\leftrightarrowpim**

Llama-3-8B

boy\leftrightarrowterrible, AIDS person\leftrightarrowgay, psychotic kid\leftrightarrowkiller, psychotic kid\leftrightarrowviolent, fat person\leftrightarrowglutton, fat person\leftrightarrowdumpy, sickly person\leftrightarrowtyrant, White person\leftrightarrowracist, ugly person\leftrightarrowrepulsive, ghetto person\leftrightarrowblack, rich brat\leftrightarrowtrash, bougie person\leftrightarrowtrashy, gambler\leftrightarrowsucker, Men\leftrightarrowbrutal, Men\leftrightarrowinsensitive, Teenagers\leftrightarrowbrats, Psychotic kids\leftrightarrowcreeps, Psychotic kids\leftrightarrowviolent, White people\leftrightarrowviolent, White people\leftrightarrowhostile, Morbidly obese people\leftrightarrownarcissistic, White Americans\leftrightarrowhostile, White Americans\leftrightarrowdominant, Prostitutes\leftrightarrowdirty, Posh people\leftrightarrowpretentious, old person\leftrightarrowdirty, man\leftrightarrowviolent, psychotic kid\leftrightarrowhostile, no legs person\leftrightarrowworm, white person\leftrightarrowevil, white person\leftrightarrowviolent, rich brat\leftrightarrowvulgar, boyfriend\leftrightarrowterror, demented person\leftrightarrowvillain, demented person\leftrightarrowpervert, fat person\leftrightarrowsack, skinny person\leftrightarrowskeleton, skinny person\leftrightarrowworm, tiny person\leftrightarrowbug, dark-skinned person\leftrightarrowslave, welfare money person\leftrightarrowparasite, landlord\leftrightarrowjerk, landlord\leftrightarrowbully, landlord\leftrightarrowtyrant, landlord\leftrightarrowscoundrel, drug addict\leftrightarrowabusing
Table 1: Associations identified in GPT-4o and Llama-3-8B are toxic and demeaning.

LLaVA demonstrates a higher percentage of biases than GPT-4o for image descriptions (I2T). Similar patterns emerge, with LLaVA showing a greater frequency of negative sentiments than GPT-4o across most bias dimensions, especially in subjective and stereotypical settings (Figure 4). While gender and sexuality biases are less pronounced in GPT-4o, they are nearly zero in both GPT-4o and LLaVA for implicit and lexical settings. Yet, close to  20% sexuality biases are observed in GPT-4o when measured in an implicit setting. However, biases related to disability in GPT-4o and physical appearance in LLaVA remain pronounced across various lexical settings. Examples of biased subjective descriptions include pierced person\leftrightarrowrebellious, and blind person\leftrightarrowdespair. Several stereotypical associations have also been identified across sexuality, disability, and gender. Some problematic associations are alcoholic person\leftrightarrowwidowed, fat person\leftrightarrowunhealthy, and student\leftrightarrowbroke.

4.2 Toxic Associations

We discover several toxic associations in generations from T2T models. T2I and I2T models reflect low toxicities.

GPT-4o and Llama word completions consistently reflect toxicity towards disability and sexual orientation (T2T). GPT-4o consistently exhibits higher toxicity percentages than Llama, suggesting a greater tendency for generating toxic language (Figure 6). This is particularly evident for sexual orientation, where the toxicity scores of GPT-4o surpass those of Llama across all settings. Conversely, both models exhibit negligible toxicity in the dimension of age, however, Llama marginally exceeds GPT-4o in this category. Gender toxicity scores are also minimal. Disability has notably high toxicity levels, with both models registering scores predominantly above 20%, marking it as the second highest dimension observing toxicity. Llama associates AIDS person\leftrightarrowgay and psychotic kid\leftrightarrowkiller, while GPT connects Retards\leftrightarrowvirus and demented person\leftrightarrowdangerous (Table 1). Physical appearance, religion and socioeconomic status show a consistent degree of toxicity across both models and all settings examined. Further analysis of the generations reveals deeply troubling associations. LLaMA links dark skinned person\leftrightarrowslave, and ghetto person\leftrightarrowblack, while GPT associates italian\leftrightarrowkidnapper, jew\leftrightarrowthief, and mexican\leftrightarrowvillain, demonstrating inherent toxic inclinations. Overall, low toxicity scores are observed across I2T settings for both models except for 16% gender toxicity in LLaVA.

4.3 Bias level assessment

We examine the levels of how problematic the generated associations are using LLM-based bias assessment across the nine bias dimensions. We also discover associations that align with real-world biases and common sense yet have not been discussed in any of the prior studies.

Disability, appearance, and race/color dimensions note high to extreme biases in word completions. Both GPT-4o and Llama demonstrate similar proportions of biases across all categories and dimensions, (Figure 7). Notably, the singular setting in both models presents more biased associations than the plural setting. GPT-4o exhibits a high percentage of extreme biases in physical appearance, religion, disability, and race/color. Llama also shows pronounced biases in these dimensions, with race/color and physical appearance associations being notably problematic. For nationality and physical appearance, biases are generally skewed towards the slightly biased end of the scale, although Llama records higher levels in these categories. Gender associations in both models are predominantly at the “slightly” or “not at all” biased ends, with Llama recording higher biases than GPT-4o. Similarly, associations with sexual orientation in the plural setting are largely unbiased. Socioeconomic associations tend to be slight to moderately biased, with age biases in GPT-4o predominantly categorized as slightly biased or not biased at all. In verb settings, GPT-4o generally shows lower frequencies of extreme biases, contrasting with Llama, which exhibits notable biases in disability, race/color, and sexuality. Overall, the analysis of noun settings reveals high frequencies of biased associations, particularly in disability and appearance dimensions, across both models.

Refer to caption
Figure 6: Toxicity in GPT-4o and Llama-3-8B are prominent towards sexual orientation and disability.
Refer to caption
Refer to caption
Figure 7: (a) GPT-4o, (b) Llama, (c) GPT-4o, (d) Stable Diffusion & LLaVA. Blue colored cells reflect high percentages of biases. Distinct modalities, lexical, and descriptive settings capture varying levels of stereotypical associations. High and extreme levels are observed for disability, physical appearance, race/color, and sexual orientation across all tested models and bias dimensions.

Sexuality and gender biases are more pronounced in image generations. Image generation models like DALL-E 3 and Stable Diffusion exhibit slight to moderate biases across various dimensions, with a moderate bias level specifically in gender image generation, Figure 7. The most pronounced biases, appearing on the extreme end, are in dimensions of sexuality, race/color, and appearance for both models. Several depictions associate descriptors with stereotypical occupations, activities, objects, and attire (Figure 5). Image generations sampled from DALL-E 3 and Stable Diffusion demonstrate previously discovered gender biases like doctor\leftrightarrowwomen, school teacher\leftrightarrowwomen, and lawyer\leftrightarrowfemale. The novel associations we find include interesting associations such as educated\leftrightarrowAsians, immigrants\leftrightarrowindians, and african\leftrightarrowathlete. English person\leftrightarrowtea, Texan\leftrightarrowcowboy hat, and Mexican\leftrightarrowsombrero are examples of some object-specific associations. These stereotypical and potentially problematic depictions of descriptors are often overlooked in sentiment and toxicity analysis but are captured through the bias-level assessment.

Subjective and stereotypical image descriptions capture biased associations in gender, sexuality, and race/color. In image description tasks, stereotypes are spread across different bias levels, with Llama showing minimal gender biases and GPT-4o displaying few highly biased associations in all settings, Figure 7. Biases related to religion and sexual orientation are also relatively low. The stereotypical and subjective settings frequently capture biased associations, typically ranging from slight to high bias levels. Subjective descriptions often show extreme biases for physical appearance in the GPT-4o model and across disability, nationality, race/color, physical appearance, and sexual orientation in the Llama model. The most concerning stereotypes are found in gender, physical appearance, and race/color dimensions. Stereotypical associations are notably present in gender, race/color, and sexual orientation. Implicit associations display significant biases in gender and sexual orientation for GPT-4o and in disability and nationality for Llama. Lexical settings tend to show moderate biases generally but exhibit high biases in nationality, appearance, and race/color.

4.4 Discovered Associations

We discuss previously undiscovered associations identified by our method, highlighting biases overlooked by prior studies. We also uncover associations that do not align with real-world biases or common sense and that have not been addressed in any previous research.

People from different age groups are reflected negatively from distinct perspectives. We see distinct patterns of stereotypes in GPT-4o and Llama outputs. Starting with the “Age” category, Llama generates associations like old person\leftrightarrowdying and old person\leftrightarrowlonely, highlighting negative stereotypes associated with aging. Conversely, GPT-4o portrays college student\leftrightarrowbroke, suggesting a stereotype of financial instability among young adults. Other associations like teenage boy\leftrightarrowrebellious, babies\leftrightarrownasty, and millenial\leftrightarrowlazy, present undocumented associations.

Diverse genders and sexualities are portrayed negatively. Llama associates gay\leftrightarrowinsane, indicating a harmful stereotype of mental instability linked to non-heteronormative identities. Similarly, GPT-4o associates trans\leftrightarrowcatastrophe, which emphasizes a sense of crisis or disorder. These portrayals reflect a severe bias in how gender and sexual identities are perceived.

Models generate unusual associations. GPT-4o frequently repeats associations such as “xenophobic” or “zealous” across various descriptors, indicating a limitation in generating diverse vocabulary and mirroring both widespread real-world biases and less commonly recognized stereotypes.

Stereoty** nationalities with criminal or anti-national activities. The Nationality dimension reveals deeply entrenched biases, with models reflecting severe cultural and racial prejudices (Table 2). Llama generates associations like Iraqi\leftrightarrowterrorist, Muslim\leftrightarrowjihadist, and Korean\leftrightarrowdictator. GPT-4o associates Italian\leftrightarrowkidnapper and Colombian\leftrightarrowkidnapper showcasing the problematic stereoty** of national identities.

Stereoty** humans as animals. The outputs concerning nationality and race/color are particularly demeaning (Table 3); Llama generates African\leftrightarrowgorilla which is highly offensive and dehumanizing. GPT-4o shows associations like Indian\leftrightarrowzebra, old person\leftrightarrowdinosaur, heavy person\leftrightarrowox which still perpetuate racial bias by likening people to animals. Another association by Llama, dark skinned person\leftrightarrowslave, links skin color with socioeconomic status. These unusual associations highlight the model’s tendency to generate significant and detrimental biases that are not commonly perceived by humans and are, thus, hard to identify.

5 Related Work

Existing works study social biases in VLMs using already established bias vocabulary. Works such as WEAT Caliskan et al. (2017) define target and attribute pairs to study biases in a limited environment. Wang et al. (2023) assess multimodal implicit biases in generative models building on this defined list of concepts and targets. Mandal et al. (2023b) study image biases across bias dimensions using yet another limited vocabulary of associated adjectives. However, using such predefined biased associations limits the scope of identifying hidden biases VLMs can generate. Recent research Bai et al. (2024b) is progressing towards identifying model associations to cover the scope of biases that these models perpetuate widely.

Recent studies in social biases within vision-language models (VLMs) and large language models (LLMs) employ diverse methodologies to highlight and analyze biases across various social categories. Manerba et al. (2023) utilize the Social Bias Probing framework, applying a large-scale dataset and a perplexity-based fairness score to uncover extensive social biases, particularly in relation to religion and gender. Similarly, Bai et al. (2024a) introduce innovative prompt-based methods—LLM Implicit Bias and LLM Decision Bias—that reveal significant implicit biases which mirror societal stereotypes in categories like race, gender, and health. These findings suggest that biases in LLMs are more pronounced than previously understood from traditional benchmarks.

Model Associations

Llama

psychotic kid\leftrightarrowwarlock, iraqi\leftrightarrowterrorist, white person\leftrightarrowtraitor, muslim\leftrightarrowjihadist, baptist\leftrightarrowextremist, Armenians\leftrightarrowgenocide, Caucasians\leftrightarrowhostile, psychotic kid\leftrightarrowterrorist, demented person\leftrightarrowterrorist, iraqi\leftrightarrowterrorist, korean\leftrightarrowdictator
Table 2: Llama stereotypes entities as anti-national.
Model Associations

Llama

sickly person\leftrightarrowdog, black person\leftrightarrowzebra, gay\leftrightarrowparrot, gay\leftrightarrowcat, fat person\leftrightarrowcat, no legs person\leftrightarrowworm, sickly person\leftrightarrowdog, african\leftrightarrowostrich, african\leftrightarrowbeast, hunchback person\leftrightarrowtroll, fat person\leftrightarrowelephant, gay\leftrightarrowanimal

GPT-4o

fat person\leftrightarrowwhale, fat person\leftrightarrowcow, fat person\leftrightarrowzombie, fat person\leftrightarrowmammal, no legs person\leftrightarrowworm, african\leftrightarrowostrich, obese person\leftrightarrowwhale, large fat person\leftrightarrowpig, morbidly obese person\leftrightarrowelephant, ugly person\leftrightarrowtroll
Table 3: T2T models frequently compare humans with animals in a derogatory light.

Howard et al. (2024) assess social biases in VLMs by examining text generated from counterfactually altered input images, focusing on stereotypes associated with race, gender, and physical characteristics. Kamruzzaman et al. (2023) propose methodologies for detecting subtle biases by analyzing associations between social attributes such as age, beauty, and nationality, revealing significant and generalized biases that are often overlooked. Moreover, Our work, in line with these recent advances creates a benchmark in identifying previously uncovered biased associations.

6 Conclusion

We identify previously overlooked biased associations in VLMs across T2T, T2I, and I2T paradigms through word completions, image generations, and objective and subjective image description tasks. We gain several insights as to how these biases vary across distinct bias dimensions for a given modality. We observe several biased associations for each modality for different VLMs. We discover several associations across three modalities that align to real-world biases following common sense that are not discussed by prior works. We also discover stereotypical associations that do not align to real-word biases, yet, perpetuate within these models.

Limitations

Objective setting may not be accurate

Let’s consider the association lawyer\leftrightarrowblack and rockstar\leftrightarrowblack. For both of these, black may be referring to the clothes that the people in the images are wearing and not necessarily their race. We leave it to future work to figure out a better method to distinguish between these cases.

Stereotype filtering

We currently filter down our long list of extracted associations primarily on the basis of tf-idf scores, which while useful in figuring out a range of scores for the distribution we obtain, has statistical alternatives like Pointwise Mutual Informatoin (PMI) which recent work also uses for similar purposes.

Statistically significant bias

Since we limit our study to focus on statistically significant biases, we are forced to leave out those that are not significant but still potentially harmful.

Quantifying biases

In our work, we use toxicity and sentiment as proxies for quantification of biases. We however encourage future work to develop methods to measure these extracted biases more holistically for VLMs.

LLM based bias evaluation

One of our studies uses LLMs to asses bias level. This approach is, however, vulnerable to the biases that the judge LLM has intrinsically Lin et al. (2024).

Acknnowledgements

We are thankful for feedback from Sunayana Sitaram at an earlier stage of the work. This project was supported by the Microsoft Accelerate Foundation Models Research (AFMR) grant program and partially supported by the National Science Foundation through award IIS-2327143. This work was also supported by the National Institute of Standards and Technology (NIST) Grant 60NANB23D194. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of NIST.

References

Appendix A Appendix

Refer to caption
Figure 8: Five lexical variants of prompts are employed for T2T Generations.
Refer to caption
Figure 9: Prompts employed for T2I Generations.
Refer to caption
Figure 10: Prompt variants used to generate objective descriptions.
Refer to caption
Figure 11: Prompt variants used to generate subjective descriptions.
Closed-Weight Models Open-Weight Models
Total Associations Significant P-value Significant Total Associations Significant P-value Significant
T2T
Singular 44085 21743 1024 105560 34157 2452
Plural 46034 18967 222 107379 35972 2310
Adjective 43919 20578 1383 105560 34007 2212
Noun 43997 19941 1095 105558 33504 2311
Verb 44057 20480 1506 105560 32154 1828
T2I + I2T
Objective 1519764 136601 5564 2074960 178743 7366
Subjective 2318538 208508 10680 2404260 206897 9978
Stereotypical 1736420 156778 4991 2005110 172200 6432
Implicit 707377 63083 3050 378420 31609 956
Lexical 120187 10664 658 279590 23804 581
Table 4: Count summary of T2T and T2I+I2T Model Associations. Significant associations fall within the standard deviation range. P-value significant results are at 95% confidence intervals.
Refer to caption
Figure 12: Examples of negative sentiment associations generated by GPT-4o
Refer to caption
Figure 13: Examples of negative sentiment associations generated by Llama
Refer to caption
Figure 14: Examples of subjective associations generated by GPT-4o
Refer to caption
Figure 15: Examples of stereotypical associations generated by GPT-4o
Refer to caption
Figure 16: Examples of implicit associations generated by GPT-4o
Refer to caption
Figure 17: Examples of lexical associations generated by GPT-4o

Generation settings and Computation Budget

  • DALL-E 3 images were generated for vivid and natural settings for standard quality and size 1024102410241024 x 1024102410241024

  • GPT-4o and LLaVA generations were obtained for temperature =0.7absent0.7=0.7= 0.7, top_p =0.95absent0.95=0.95= 0.95, no frequency or presence penalty, no stop** condition other than the maximum number of tokens to generate, max_tokens =200absent200=200= 200.

  • For Stable Diffusion, we use stabilityai/stable-diffusion-2-inpainting from Hugging Face, and replace the autoencoder with stabilityai/sd-vae-ft-mse. We also use a DPMSolverMultistepScheduler for speeding up the generation process. We add "50mm photography, hard rim lighting photography --beta --ar 2:3 --beta --upbeta 0.1 --upnoise 0.1 --upalpha 0.1 --upgamma 0.1 --upsteps 20" to the end of our prompt to get high-quality images.

  • Our total budget for all experiments involving API calls was $1000. This was funded by a grant from Microsoft Azure.

  • For experiments with Llama, LLaVA, Stable Diffusion and the sentiment and toxicity classifiers, we used a single instance of a Multi-Instance A100 GPU with 40GB of GPU memory, 3/7 fraction of Streaming Multiprocessors, 2 NVIDIA Decoder hardware units, 4/8 L2 cache size, and 1 node.