Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Shang, **ghuan; Das, Srijan; Ryoo, Michael S.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.11895v4 (cs)

[Submitted on 23 Jun 2022 (v1), last revised 13 Jan 2023 (this version, v4)]

Title:Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Authors:**ghuan Shang, Srijan Das, Michael S. Ryoo

View PDF

Abstract:Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at this https URL.

Comments:	NeurIPS 2022. Our code is at this https URL Our project page is at this https URL v3, v4 for minor updates on figures and visualizations
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:	arXiv:2206.11895 [cs.CV]
	(or arXiv:2206.11895v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.11895

Submission history

From: **ghuan Shang [view email]
[v1] Thu, 23 Jun 2022 17:59:35 UTC (35,288 KB)
[v2] Wed, 12 Oct 2022 22:00:53 UTC (49,124 KB)
[v3] Fri, 28 Oct 2022 03:28:06 UTC (45,745 KB)
[v4] Fri, 13 Jan 2023 00:53:21 UTC (44,869 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators