GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Xu, Xuwei; Wang, Sen; Chen, Yudong; Zheng, Yan**; Wei, Zhewei; Liu, Jiajun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.03035 (cs)

[Submitted on 6 Nov 2023 (v1), last revised 8 Jan 2024 (this version, v2)]

Title:GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Authors:Xuwei Xu, Sen Wang, Yudong Chen, Yan** Zheng, Zhewei Wei, Jiajun Liu

View PDF HTML (experimental)

Abstract:Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging due to high computational demands. To expedite pre-trained ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in the computation. However, these methods still have some limitations, such as image information loss from pruned tokens and inefficiency in the token-matching process. In this paper, we introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs. Inspired by graph summarization algorithms, GTP meticulously propagates less significant tokens' information to spatially and semantically connected tokens that are of greater importance. Consequently, the remaining few tokens serve as a summarization of the entire token graph, allowing the method to reduce computational complexity while preserving essential information of eliminated tokens. Combined with an innovative token selection strategy, GTP can efficiently identify image tokens to be propagated. Extensive experiments have validated GTP's effectiveness, demonstrating both efficiency and performance improvements. Specifically, GTP decreases the computational complexity of both DeiT-S and DeiT-B by up to 26% with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and remarkably surpasses the state-of-the-art token merging method on various backbones at an even faster inference speed. The source code is available at this https URL.

Comments:	Accepted to WACV2024 (Oral)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2311.03035 [cs.CV]
	(or arXiv:2311.03035v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.03035

Submission history

From: Xuwei Xu [view email]
[v1] Mon, 6 Nov 2023 11:14:19 UTC (4,447 KB)
[v2] Mon, 8 Jan 2024 03:42:25 UTC (4,447 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators