Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Mustafa, Basil; Riquelme, Carlos; Puigcerver, Joan; Jenatton, Rodolphe; Houlsby, Neil

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.02770 (cs)

[Submitted on 6 Jun 2022]

Title:Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Authors:Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, Neil Houlsby

View PDF

Abstract:Large sparsely-activated models have obtained excellent performance in multiple domains. However, such models are typically trained on a single modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning. LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss. MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities. However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme. Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMoE, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-specific experts.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.02770 [cs.CV]
	(or arXiv:2206.02770v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.02770

Submission history

From: Basil Mustafa [view email]
[v1] Mon, 6 Jun 2022 17:51:59 UTC (6,078 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators