Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Brendan Park, Madeline Janecek¹¹footnotemark: 1, Naser Ezzati-Jivan, Yifeng Li, and Ali Emami
Brock University, St. Catharines, Ontario, Canada
{bp18ul, mj17th, nezzatijivan, yli2, aemami}@brocku.ca
Equal contribution.

Abstract

Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models’ ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, showing minimal improvement from past iterations and only marginally surpassing random guessing. Further error analysis identifies important areas for future research aimed at advancing text-to-image models in their ability to interpret and interact with the complex visual world.

Brendan Park^†^†thanks: Equal contribution., Madeline Janecek¹¹footnotemark: 1, Naser Ezzati-Jivan, Yifeng Li, and Ali Emami Brock University, St. Catharines, Ontario, Canada {bp18ul, mj17th, nezzatijivan, yli2, aemami}@brocku.ca

1 Introduction

The interpretation of ambiguous constructs in language is crucial for assessing common-sense reasoning, with the Winograd Schema Challenge (WSC) Levesque et al. (2011); Winograd (1972) significantly influencing the evaluation of natural language understanding models. Advances in transformer-based architectures have led Large Language Models (LLMs) to achieve impressive results on WSC-based tasks, approaching near-human performance Brown et al. (2020); Sakaguchi et al. (2020); Kocijan et al. (2023).

Refer to caption — Figure 1: A representative output from Stable Diffusion 2.0 on a WinoVis instance. The Diffusion Attentive Attribution Maps (DAAM) clarify the model’s focus for different terms and the correctness of its interpretation: correctly identifying ‘bee’ and ‘flower’ but erroneously associating ‘it’ with the bee instead of the flower.

Extending common-sense reasoning into multimodal domains, especially disambiguation tasks, is a persisting challenge. Despite the ability of models like Google’s Imagen Saharia et al. (2022), OpenAI’s DALL-E 2 Ramesh et al. (2022), and Stability AI’s recently open-sourced Stable Diffusion Rombach et al. (2022) to create visually compelling images from text, their interpretability—essential for deciphering the models’ reasoning processes—is notably limited Tang et al. (2023). This gap restricts the development of tools for visuals that match complex texts, reducing model effectiveness when deployed in areas like education and digital media, where text-image integration is essential Dehouche and Dehouche (2023); Hattori and Takahara (2023).

Our response to this challenge is WinoVis, a dataset aimed at probing text-to-image models’ common-sense reasoning capabilities through pronoun disambiguation within multimodal scenarios. WinoVis not only tests models’ ability to distinguish entities within the generated images, but also examines how these models associate pronouns with the correct referents, a nuanced aspect of common-sense reasoning that has been overlooked. As depicted in the WinoVis example in Figure 1, while newer Stable Diffusion models can accurately separate entities within an image, they fail to correctly associate the pronoun ‘it’ with the intended referent, revealing the subtleties and potential gaps in multimodal common-sense reasoning.

The development of WinoVis leveraged the generative power of GPT-4 OpenAI (2023); Gilardi et al. (2023), using a methodical approach to create and refine prompts that elicit common-sense reasoning visually. This process included a complete manual review to ensure each scenario’s clarity and relevance for the disambiguation task. Moreover, we introduce a novel evaluation framework that distinguishes between models’ pronoun disambiguation proficiency from their handling of visual processing challenges, such as susceptibility to typographic attacks Goh et al. (2021) and semantic entanglement Wu et al. (2023).

Our contributions are summarized as follows:

•

WSC-Adapted Multimodal Dataset (WinoVis): A dataset of 500 scenarios for benchmarking text-to-image models’ pronoun disambiguation abilities within a visual context.¹¹1The dataset has been made available at https://github.com/bpark2/WinoVis.
•

Novel Evaluation Framework for Multimodal Disambiguation: Metrics and methods designed to isolate pronoun resolution from other visual processing challenges, advancing the understanding of models’ common-sense reasoning.
•

Insight into Stable Diffusion’s Common-Sense Reasoning: A critical analysis revealing that even state-of-the-art models like Stable Diffusion 2.0 fall significantly short of human-level performance.

2 Background

2.1 Latent Diffusion in Image Generation

Latent diffusion models (LDMs) represent a class of generative models designed to synthesize images by progressively refining random noise. A prominent example is Stable Diffusion Rombach et al. (2022), a text-to-image LDM optimized to generate images from textual prompts. Stable Diffusion integrates three primary components: a deep language model that extracts semantic embeddings from textual prompts; an encoder-decoder architecture for encoding images into latent space representations and decoding them back; and a neural network that is responsible for mean-prediction Ho et al. (2020) (denoted as $\mu_{\theta}(\boldsymbol{z},\boldsymbol{y},t)$ ), noise-prediction Ho et al. (2020) (denoted as $\epsilon_{\theta}(\boldsymbol{z},\boldsymbol{y},t)$ ), or score-prediction Song and Ermon (2019) (denoted as $s_{\theta}(\boldsymbol{z},\boldsymbol{y},t)$ ). This network is trained on image and text pairs $\boldsymbol{x}$ and $\boldsymbol{y}$ . During training, which aims to maximize the evidence lower bound (ELBO) Sohl-Dickstein et al. (2015), the image is initially encoded to $\boldsymbol{z}_{0}$ , marking the start of the forward diffusion process, formalized as:

	$\displaystyle p(\boldsymbol{z}_{t}\|\boldsymbol{z}_{t-1})$	$\displaystyle=\mathcal{N}(\boldsymbol{z}_{t}\|\sqrt{\alpha_{t}}\boldsymbol{z}_{% t-1},(1-\alpha_{t})\boldsymbol{I})$
		$\displaystyle=\mathcal{N}(\boldsymbol{z}_{t}\|\sqrt{1-\beta_{t}}\boldsymbol{z}_% {t-1},\beta_{t}\boldsymbol{I}),$

where $\boldsymbol{z}_{t}$ denotes the latent variable at step $t$ , with $\beta_{t}=1-\alpha_{t}$ as the noise schedule hyperparameter, and $\boldsymbol{I}$ the identity matrix. The U-Net architecture Ronneberger et al. (2015), used for denoising, iteratively reverses the diffusion through:

	$\displaystyle p(\boldsymbol{z}_{t-1}\|\boldsymbol{z}_{t})=\mathcal{N}\big{(}% \boldsymbol{z}_{t-1}\|\mu_{\theta}(\boldsymbol{z}_{t},\boldsymbol{y},t),\sigma_% {t}^{2}\boldsymbol{I}\big{)}$
	$\displaystyle=\mathcal{N}\Big{(}\boldsymbol{z}_{t-1}\|\frac{\boldsymbol{z}_{t}-% \frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}(\boldsymbol{z}_{t% },\boldsymbol{y},t)}{\sqrt{\alpha_{t}}},\sigma_{t}^{2}\boldsymbol{I}\Big{)}$

with $\sigma_{t}^{2}$ as the reverse process noise variance. Cross-attention in the U-Net layers aligns $\boldsymbol{z}_{t}$ with $\boldsymbol{y}$ . For conditional generation, the process starts with Gaussian noise $\boldsymbol{z}_{T}$ , conditioned on text $\boldsymbol{y}$ , and refines through reverse diffusion, resembling Langevin dynamics Welling and Teh (2011). For instance, using the score function, we have

\displaystyle\boldsymbol{z}_{t-1}^{(j)}=\boldsymbol{z}_{t-1}^{(j-1)}+\frac{% \alpha_{t}}{2}s_{\theta}(\boldsymbol{z}_{t-1}^{(j-1)},\boldsymbol{y},t)+\sqrt{% \alpha_{t}}\varepsilon_{j},

where $j=1,\ldots,J$ , $J$ is the number of Langevin steps, $\boldsymbol{z}_{t-1}^{(0)}=\boldsymbol{z}_{t}$ , $\boldsymbol{z}_{t-1}=\boldsymbol{z}_{t-1}^{(J)}$ , and $\varepsilon_{j}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ . The denoised $\boldsymbol{z}_{0}$ generates the final image, such as the one exemplified in Figure 2 given a WinoVis instance.

2.2 Diffusion Attentive Attribution Maps

The Diffusion Attentive Attribution Map (DAAM) technique facilitates interpretability of the influence that different tokens in a prompt have on the image generated by Stable Diffusion models Tang et al. (2023). This approach capitalizes on the multi-head cross-attention mechanism Vaswani et al. (2023), aggregating attention scores from both downsampling and upsampling stages within the U-Net architecture. The attention scores, denoted as $F_{t}^{(i)\downarrow}$ for downsampling and $F_{t}^{(i)\uparrow}$ for upsampling, link specific words from the prompt to image regions, signified by coordinates $(x,y)$ , across different heads ( $i$ ) and layers ( $l$ ).

To synthesize a comprehensive heatmap from these attention scores, DAAM applies a spatial normalization procedure, scaling the attention scores for the $k$ -th word to match the original image size and summing them across all attention heads ( $i$ ), layers ( $l$ ), and time steps ( $t$ ):

D_{k}[x,y]=\sum_{i,t,l}\left(F_{t}^{(i)\downarrow}[x,y,l,k]+F_{t}^{(i)\uparrow% }[x,y,l,k]\right)

where $F_{t}^{(i)\downarrow}[x,y,l,k]$ and $F_{t}^{(i)\uparrow}[x,y,l,k]$ represent the bicubically upscaled attention scores for the downsampling and upsampling pathways, respectively.

DAAM can therefore offer a visual method to evaluate how Stable Diffusion performs pronoun disambiguation, by illustrating where the model concentrates its attention in relation to textual prompts. By examining these visualizations, as demonstrated in Figure 1, we can discern the model’s implicit strategies for linking pronouns with their correct referents.