Contrastive Vision-Language Pre-training with Limited Resources

Cui, Quan; Zhou, Boyan; Guo, Yu; Yin, Weidong; Wu, Hao; Yoshie, Osamu; Chen, Yubo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2112.09331 (cs)

[Submitted on 17 Dec 2021 (v1), last revised 18 Jul 2022 (this version, v3)]

Title:Contrastive Vision-Language Pre-training with Limited Resources

Authors:Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, Yubo Chen

View PDF

Abstract:Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our methods on large-scale data. We hope that this work will provide useful data points and experience for future research in contrastive vision-language pre-training. Code is available at this https URL.

Comments:	Accepted to ECCV2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2112.09331 [cs.CV]
	(or arXiv:2112.09331v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2112.09331

Submission history

From: Quan Cui [view email]
[v1] Fri, 17 Dec 2021 05:40:28 UTC (1,014 KB)
[v2] Tue, 18 Jan 2022 11:56:47 UTC (1,026 KB)
[v3] Mon, 18 Jul 2022 06:05:07 UTC (2,689 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Contrastive Vision-Language Pre-training with Limited Resources

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Contrastive Vision-Language Pre-training with Limited Resources

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators