Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Huang, Po-Yao; Patrick, Mandela; Hu, Junjie; Neubig, Graham; Metze, Florian; Hauptmann, Alexander

Computer Science > Computer Vision and Pattern Recognition

arXiv:2103.08849 (cs)

[Submitted on 16 Mar 2021 (v1), last revised 15 Apr 2021 (this version, v3)]

Title:Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Authors:Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann

View PDF

Abstract:This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at this http URL.

Comments:	accepted by NAACL 2021
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2103.08849 [cs.CV]
	(or arXiv:2103.08849v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2103.08849

Submission history

From: Po-Yao Huang [view email]
[v1] Tue, 16 Mar 2021 04:37:40 UTC (20,692 KB)
[v2] Thu, 18 Mar 2021 17:40:09 UTC (20,407 KB)
[v3] Thu, 15 Apr 2021 02:01:38 UTC (21,352 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators