Adaptive Inverse Transform Sampling For Efficient Vision Transformers

Fayyaz, Mohsen; Koohpayegani, Soroush Abbasi; Jafari, Farnoush Rezaei; Sengupta, Sunando; Joze, Hamid Reza Vaezi; Sommerlade, Eric; Pirsiavash, Hamed; Gall, Juergen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.15667v2 (cs)

[Submitted on 30 Nov 2021 (v1), revised 24 Mar 2022 (this version, v2), latest version 26 Jul 2022 (v3)]

Title:Adaptive Inverse Transform Sampling For Efficient Vision Transformers

Authors:Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, Juergen Gall

View PDF

Abstract:While state-of-the-art vision transformer models achieve promising results for image classification, they are computationally expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we, therefore, introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not constant anymore and varies for each input image. By integrating ATS as an additional layer within current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to off-the-shelf pre-trained vision transformers as a plug-and-play module, thus reducing their GFLOPs without any additional training. Moreover, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate our module on both image and video classification tasks by adding it to multiple SOTA vision transformers. Our proposed module improves the SOTA by reducing the computational cost (GFLOPs) by 2x while preserving the accuracy of SOTA models on ImageNet, Kinetics-400, and Kinetics-600 datasets.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2111.15667 [cs.CV]
	(or arXiv:2111.15667v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.15667

Submission history

From: Mohsen Fayyaz [view email]
[v1] Tue, 30 Nov 2021 18:56:57 UTC (21,551 KB)
[v2] Thu, 24 Mar 2022 22:49:18 UTC (22,149 KB)
[v3] Tue, 26 Jul 2022 17:54:59 UTC (24,350 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Adaptive Inverse Transform Sampling For Efficient Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Adaptive Inverse Transform Sampling For Efficient Vision Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators