Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Wang, Youze; Hu, Wenbo; Dong, Yinpeng; Zhang, Hanwang; Hong, Richang

Computer Science > Multimedia

arXiv:2308.12636 (cs)

[Submitted on 24 Aug 2023 (v1), last revised 5 Nov 2023 (this version, v2)]

Title:Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Authors:Youze Wang, Wenbo Hu, Yinpeng Dong, Hanwang Zhang, Richang Hong

View PDF

Abstract:Vision-language pre-training models (VLP) are vulnerable, especially to multimodal adversarial samples, which can be crafted by adding imperceptible perturbations on both original images and texts. However, under the black-box setting, there have been no works to explore the transferability of multimodal adversarial attacks against the VLP models. In this work, we take CLIP as the surrogate model and propose a gradient-based multimodal attack method to generate transferable adversarial samples against the VLP models. By applying the gradient to optimize the adversarial images and adversarial texts simultaneously, our method can better search for and attack the vulnerable images and text information pairs. To improve the transferability of the attack, we utilize contrastive learning including image-text contrastive learning and intra-modal contrastive learning to have a more generalized understanding of the underlying data distribution and mitigate the overfitting of the surrogate model so that the generated multimodal adversarial samples have a higher transferability for VLP models. Extensive experiments validate the effectiveness of the proposed method.

Subjects:	Multimedia (cs.MM)
Cite as:	arXiv:2308.12636 [cs.MM]
	(or arXiv:2308.12636v2 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2308.12636

Submission history

From: Youze Wang [view email]
[v1] Thu, 24 Aug 2023 08:22:21 UTC (803 KB)
[v2] Sun, 5 Nov 2023 02:07:32 UTC (803 KB)

Computer Science > Multimedia

Title:Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:Exploring Transferability of Multimodal Adversarial Samples for Vision-Language Pre-training Models with Contrastive Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators