Search | arXiv e-print repository

Updating CLIP to Prefer Descriptions Over Captions

Authors: Amir Zur, Elisa Kreiss, Karel D'Oosterlinck, Christopher Potts, Atticus Geiger

Abstract: Although CLIPScore is a powerful generic metric that captures the similarity between a text and an image, it fails to distinguish between a caption that is meant to complement the information in an image and a description that is meant to replace an image entirely, e.g., for accessibility. We address this shortcoming by updating the CLIP model with the Concadia dataset to assign higher scores to d… ▽ More Although CLIPScore is a powerful generic metric that captures the similarity between a text and an image, it fails to distinguish between a caption that is meant to complement the information in an image and a description that is meant to replace an image entirely, e.g., for accessibility. We address this shortcoming by updating the CLIP model with the Concadia dataset to assign higher scores to descriptions than captions using parameter efficient fine-tuning and a loss objective derived from work on causal interpretability. This model correlates with the judgements of blind and low-vision people while preserving transfer capabilities and has interpretable structure that sheds light on the caption--description distinction. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2402.15002 [pdf, other]

CommVQA: Situating Visual Question Answering in Communicative Contexts

Authors: Nandita Shankar Naik, Christopher Potts, Elisa Kreiss

Abstract: Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descripti… ▽ More Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic contexts shapes visual questions, we introduce CommVQA, a VQA dataset consisting of images, image descriptions, real-world communicative scenarios where the image might appear (e.g., a travel website), and follow-up questions and answers conditioned on the scenario. We show that CommVQA poses a challenge for current models. Providing contextual information to VQA models improves performance broadly, highlighting the relevance of situating systems within a communicative scenario. △ Less

Submitted 22 February, 2024; originally announced February 2024.

arXiv:2309.11710 [pdf, other]

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

Authors: Elisa Kreiss, Eric Zelikman, Christopher Potts, Nick Haber

Abstract: Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two compo… ▽ More Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence. △ Less

Submitted 20 September, 2023; originally announced September 2023.

arXiv:2307.15745 [pdf, other]

Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering

Authors: Nandita Naik, Christopher Potts, Elisa Kreiss

Abstract: Visual question answering (VQA) has the potential to make the Internet more accessible in an interactive way, allowing people who cannot see images to ask questions about them. However, multiple studies have shown that people who are blind or have low-vision prefer image explanations that incorporate the context in which an image appears, yet current VQA datasets focus on images in isolation. We a… ▽ More Visual question answering (VQA) has the potential to make the Internet more accessible in an interactive way, allowing people who cannot see images to ask questions about them. However, multiple studies have shown that people who are blind or have low-vision prefer image explanations that incorporate the context in which an image appears, yet current VQA datasets focus on images in isolation. We argue that VQA models will not fully succeed at meeting people's needs unless they take context into account. To further motivate and analyze the distinction between different contexts, we introduce Context-VQA, a VQA dataset that pairs images with contexts, specifically types of websites (e.g., a shop** website). We find that the types of questions vary systematically across contexts. For example, images presented in a travel context garner 2 times more "Where?" questions, and images on social media and news garner 2.8 and 1.8 times more "Who?" questions than the average. We also find that context effects are especially important when participants can't see the image. These results demonstrate that context affects the types of questions asked and that VQA models should be context-sensitive to better meet people's needs, especially in accessibility settings. △ Less

Submitted 30 August, 2023; v1 submitted 28 July, 2023; originally announced July 2023.

Comments: Proceedings of ICCV 2023 Workshop on Closing the Loop Between Vision and Language

arXiv:2305.09038 [pdf]

Characterizing Image Accessibility on Wikipedia across Languages

Authors: Elisa Kreiss, Krishna Srinivasan, Tiziano Piccardi, Jesus Adolfo Hermosillo, Cynthia Bennett, Michael S. Bernstein, Meredith Ringel Morris, Christopher Potts

Abstract: We make a first attempt to characterize image accessibility on Wikipedia across languages, present new experimental results that can inform efforts to assess description quality, and offer some strategies to improve Wikipedia's image accessibility. We make a first attempt to characterize image accessibility on Wikipedia across languages, present new experimental results that can inform efforts to assess description quality, and offer some strategies to improve Wikipedia's image accessibility. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: Presented at Wiki Workshop 2023

arXiv:2205.10646 [pdf, other]

Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics

Authors: Elisa Kreiss, Cynthia Bennett, Shayan Hooshmand, Eric Zelikman, Meredith Ringel Morris, Christopher Potts

Abstract: Few images on the Web receive alt-text descriptions that would make them accessible to blind and low vision (BLV) users. Image-based NLG systems have progressed to the point where they can begin to address this persistent societal problem, but these systems will not be fully successful unless we evaluate them on metrics that guide their development correctly. Here, we argue against current referen… ▽ More Few images on the Web receive alt-text descriptions that would make them accessible to blind and low vision (BLV) users. Image-based NLG systems have progressed to the point where they can begin to address this persistent societal problem, but these systems will not be fully successful unless we evaluate them on metrics that guide their development correctly. Here, we argue against current referenceless metrics -- those that don't rely on human-generated ground-truth descriptions -- on the grounds that they do not align with the needs of BLV users. The fundamental shortcoming of these metrics is that they do not take context into account, whereas contextual information is highly valued by BLV users. To substantiate these claims, we present a study with BLV participants who rated descriptions along a variety of dimensions. An in-depth analysis reveals that the lack of context-awareness makes current referenceless metrics inadequate for advancing image accessibility. As a proof-of-concept, we provide a contextual version of the referenceless metric CLIPScore which begins to address the disconnect to the BLV data. An accessible HTML version of this paper is available at https://elisakreiss.github.io/contextual-description-evaluation/paper/reflessmetrics.html △ Less

Submitted 27 October, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

Comments: Proceedings of EMNLP 2022

arXiv:2205.09172 [pdf, other]

Color Overmodification Emerges from Data-Driven Learning and Pragmatic Reasoning

Authors: Fei Fang, Kunal Sinha, Noah D. Goodman, Christopher Potts, Elisa Kreiss

Abstract: Speakers' referential expressions often depart from communicative ideals in ways that help illuminate the nature of pragmatic language use. Patterns of overmodification, in which a speaker uses a modifier that is redundant given their communicative goal, have proven especially informative in this regard. It seems likely that these patterns are shaped by the environment a speaker is exposed to in c… ▽ More Speakers' referential expressions often depart from communicative ideals in ways that help illuminate the nature of pragmatic language use. Patterns of overmodification, in which a speaker uses a modifier that is redundant given their communicative goal, have proven especially informative in this regard. It seems likely that these patterns are shaped by the environment a speaker is exposed to in complex ways. Unfortunately, systematically manipulating these factors during human language acquisition is impossible. In this paper, we propose to address this limitation by adopting neural networks (NN) as learning agents. By systematically varying the environments in which these agents are trained, while kee** the NN architecture constant, we show that overmodification is more likely with environmental features that are infrequent or salient. We show that these findings emerge naturally in the context of a probabilistic model of pragmatic communication. △ Less

Submitted 18 May, 2022; originally announced May 2022.

Comments: Proceedings of the Annual Meeting of the Cognitive Science Society (2022)

arXiv:2112.02505 [pdf, other]

Causal Distillation for Language Models

Authors: Zhengxuan Wu, Atticus Geiger, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D. Goodman

Abstract: Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In thi… ▽ More Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition). △ Less

Submitted 3 June, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: 7 pages, 2 figures

Journal ref: NAACL 2022

arXiv:2112.00826 [pdf, other]

Inducing Causal Structure for Interpretable Neural Networks

Authors: Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah D. Goodman, Christopher Potts

Abstract: In many areas, we have well-founded insights about causal structure that would be useful to bring into our trained models while still allowing them to learn in a data-driven fashion. To achieve this, we present the new method of interchange intervention training (IIT). In IIT, we (1) align variables in a causal model (e.g., a deterministic program or Bayesian network) with representations in a neu… ▽ More In many areas, we have well-founded insights about causal structure that would be useful to bring into our trained models while still allowing them to learn in a data-driven fashion. To achieve this, we present the new method of interchange intervention training (IIT). In IIT, we (1) align variables in a causal model (e.g., a deterministic program or Bayesian network) with representations in a neural model and (2) train the neural model to match the counterfactual behavior of the causal model on a base input when aligned representations in both models are set to be the value they would be for a source input. IIT is fully differentiable, flexibly combines with other objectives, and guarantees that the target causal model is a causal abstraction of the neural model when its loss is zero. We evaluate IIT on a structural vision task (MNIST-PVR), a navigational language task (ReaSCAN), and a natural language inference task (MQNLI). We compare IIT against multi-task training objectives and data augmentation. In all our experiments, IIT achieves the best results and produces neural models that are more interpretable in the sense that they more successfully realize the target causal model. △ Less

Submitted 20 July, 2022; v1 submitted 1 December, 2021; originally announced December 2021.

arXiv:2109.08994 [pdf, other]

ReaSCAN: Compositional Reasoning in Language Grounding

Authors: Zhengxuan Wu, Elisa Kreiss, Desmond C. Ong, Christopher Potts

Abstract: The ability to compositionally map language to referents, relations, and actions is an essential component of language understanding. The recent gSCAN dataset (Ruis et al. 2020, NeurIPS) is an inspiring attempt to assess the capacity of models to learn this kind of grounding in scenarios involving navigational instructions. However, we show that gSCAN's highly constrained design means that it does… ▽ More The ability to compositionally map language to referents, relations, and actions is an essential component of language understanding. The recent gSCAN dataset (Ruis et al. 2020, NeurIPS) is an inspiring attempt to assess the capacity of models to learn this kind of grounding in scenarios involving navigational instructions. However, we show that gSCAN's highly constrained design means that it does not require compositional interpretation and that many details of its instructions and scenarios are not required for task success. To address these limitations, we propose ReaSCAN, a benchmark dataset that builds off gSCAN but requires compositional language interpretation and reasoning about entities and relations. We assess two models on ReaSCAN: a multi-modal baseline and a state-of-the-art graph convolutional neural model. These experiments show that ReaSCAN is substantially harder than gSCAN for both neural architectures. This suggests that ReaSCAN can serve as a valuable benchmark for advancing our understanding of models' compositional generalization and reasoning capabilities. △ Less

Submitted 18 September, 2021; originally announced September 2021.

Comments: 26 pages, 8 figures

Journal ref: NeurIPS 2021 (Dataset and Benchmarks Track)

arXiv:2104.08376 [pdf, other]

Concadia: Towards Image-Based Text Generation with a Purpose

Authors: Elisa Kreiss, Fei Fang, Noah D. Goodman, Christopher Potts

Abstract: Current deep learning models often achieve excellent results on benchmark image-to-text datasets but fail to generate texts that are useful in practice. We argue that to close this gap, it is vital to distinguish descriptions from captions based on their distinct communicative roles. Descriptions focus on visual features and are meant to replace an image (often to increase accessibility), whereas… ▽ More Current deep learning models often achieve excellent results on benchmark image-to-text datasets but fail to generate texts that are useful in practice. We argue that to close this gap, it is vital to distinguish descriptions from captions based on their distinct communicative roles. Descriptions focus on visual features and are meant to replace an image (often to increase accessibility), whereas captions appear alongside an image to supply additional information. To motivate this distinction and help people put it into practice, we introduce the publicly available Wikipedia-based dataset Concadia consisting of 96,918 images with corresponding English-language descriptions, captions, and surrounding context. Using insights from Concadia, models trained on it, and a preregistered human-subjects experiment with human- and model-generated texts, we characterize the commonalities and differences between descriptions and captions. In addition, we show that, for generating both descriptions and captions, it is useful to augment image-to-text models with representations of the textual context in which the image appeared. △ Less

Submitted 27 October, 2022; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Proceedings of EMNLP 2022

arXiv:2006.09589 [pdf, other]

Modeling Subjective Assessments of Guilt in Newspaper Crime Narratives

Authors: Elisa Kreiss, Zijian Wang, Christopher Potts

Abstract: Crime reporting is a prevalent form of journalism with the power to shape public perceptions and social policies. How does the language of these reports act on readers? We seek to address this question with the SuspectGuilt Corpus of annotated crime stories from English-language newspapers in the U.S. For SuspectGuilt, annotators read short crime articles and provided text-level ratings concerning… ▽ More Crime reporting is a prevalent form of journalism with the power to shape public perceptions and social policies. How does the language of these reports act on readers? We seek to address this question with the SuspectGuilt Corpus of annotated crime stories from English-language newspapers in the U.S. For SuspectGuilt, annotators read short crime articles and provided text-level ratings concerning the guilt of the main suspect as well as span-level annotations indicating which parts of the story they felt most influenced their ratings. SuspectGuilt thus provides a rich picture of how linguistic choices affect subjective guilt judgments. In addition, we use SuspectGuilt to train and assess predictive models, and show that these models benefit from genre pretraining and joint supervision from the text-level ratings and span-level annotations. Such models might be used as tools for understanding the societal effects of crime reporting. △ Less

Submitted 14 October, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

Comments: CoNLL 2020

arXiv:1903.08237 [pdf, other]

When redundancy is useful: A Bayesian approach to 'overinformative' referring expressions

Authors: Judith Degen, Robert D. Hawkins, Caroline Graf, Elisa Kreiss, Noah D. Goodman

Abstract: Referring is one of the most basic and prevalent uses of language. How do speakers choose from the wealth of referring expressions at their disposal? Rational theories of language use have come under attack for decades for not being able to account for the seemingly irrational overinformativeness ubiquitous in referring expressions. Here we present a novel production model of referring expressions… ▽ More Referring is one of the most basic and prevalent uses of language. How do speakers choose from the wealth of referring expressions at their disposal? Rational theories of language use have come under attack for decades for not being able to account for the seemingly irrational overinformativeness ubiquitous in referring expressions. Here we present a novel production model of referring expressions within the Rational Speech Act framework that treats speakers as agents that rationally trade off cost and informativeness of utterances. Crucially, we relax the assumption that informativeness is computed with respect to a deterministic Boolean semantics, in favor of a non-deterministic continuous semantics. This innovation allows us to capture a large number of seemingly disparate phenomena within one unified framework: the basic asymmetry in speakers' propensity to overmodify with color rather than size; the increase in overmodification in complex scenes; the increase in overmodification with atypical features; and the increase in specificity in nominal reference as a function of typicality. These findings cast a new light on the production of referring expressions: rather than being wastefully overinformative, reference is usefully redundant. △ Less

Submitted 10 December, 2019; v1 submitted 19 March, 2019; originally announced March 2019.

Showing 1–13 of 13 results for author: Kreiss, E