Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Salman, Shaeke; Shams, Md Montasir Bin; Liu, Xiuwen; Zhu, Lingjiong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.08473 (cs)

[Submitted on 13 Feb 2024]

Title:Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Authors:Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu, Lingjiong Zhu

View PDF HTML (experimental)

Abstract:Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.

Comments:	30 pages, 30 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2402.08473 [cs.CV]
	(or arXiv:2402.08473v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.08473

Submission history

From: Shaeke Salman [view email]
[v1] Tue, 13 Feb 2024 14:07:49 UTC (46,265 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators