Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Reza, Sakib; Sundareshan, Balaji; Moghaddam, Mohsen; Camps, Octavia

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.11365 (cs)

[Submitted on 19 May 2023 (v1), last revised 23 May 2023 (this version, v2)]

Title:Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Authors:Sakib Reza, Balaji Sundareshan, Mohsen Moghaddam, Octavia Camps

View PDF

Abstract:Egocentric temporal action segmentation in videos is a crucial task in computer vision with applications in various fields such as mixed reality, human behavior analysis, and robotics. Although recent research has utilized advanced visual-language frameworks, transformers remain the backbone of action segmentation models. Therefore, it is necessary to improve transformers to enhance the robustness of action segmentation models. In this work, we propose two novel ideas to enhance the state-of-the-art transformer for action segmentation. First, we introduce a dual dilated attention mechanism to adaptively capture hierarchical representations in both local-to-global and global-to-local contexts. Second, we incorporate cross-connections between the encoder and decoder blocks to prevent the loss of local context by the decoder. We also utilize state-of-the-art visual-language representation learning techniques to extract richer and more compact features for our transformer. Our proposed approach outperforms other state-of-the-art methods on the Georgia Tech Egocentric Activities (GTEA) and HOI4D Office Tools datasets, and we validate our introduced components with ablation studies. The source code and supplementary materials are publicly available on this https URL.

Comments:	Joint 3rd Ego4D and 11th EPIC Workshop on Egocentric Vision at CVPR 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.11365 [cs.CV]
	(or arXiv:2305.11365v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.11365

Submission history

From: Sakib Reza [view email]
[v1] Fri, 19 May 2023 01:00:08 UTC (5,953 KB)
[v2] Tue, 23 May 2023 20:38:40 UTC (5,953 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Enhancing Transformer Backbone for Egocentric Video Action Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators