ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Hernandez, Jefferson; Villegas, Ruben; Ordonez, Vicente

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.12001 (cs)

[Submitted on 21 Mar 2023 (v1), last revised 30 Nov 2023 (this version, v2)]

Title:ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Authors:Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

View PDF

Abstract:We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global featured obtained by pooling the local representations learned under an MAE reconstruction loss and leveraging this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to both video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark . When training on videos and images from a diverse combination of datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best supervised method.

Comments:	More results on Video an Image datasets, ViC-MAE now supports training on videos and images
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.12001 [cs.CV]
	(or arXiv:2303.12001v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.12001

Submission history

From: Jefferson Hernandez Enrique [view email]
[v1] Tue, 21 Mar 2023 16:33:40 UTC (704 KB)
[v2] Thu, 30 Nov 2023 15:53:00 UTC (1,810 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators