How Much Temporal Long-Term Context is Needed for Action Segmentation?

Bahrami, Emad; Francesca, Gianpiero; Gall, Juergen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.11358v1 (cs)

[Submitted on 22 Aug 2023 (this version), latest version 25 Sep 2023 (v2)]

Title:How Much Temporal Long-Term Context is Needed for Action Segmentation?

Authors:Emad Bahrami, Gianpiero Francesca, Juergen Gall

View PDF

Abstract:Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.

Comments:	ICCV 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2308.11358 [cs.CV]
	(or arXiv:2308.11358v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.11358

Submission history

From: Emad Bahrami [view email]
[v1] Tue, 22 Aug 2023 11:20:40 UTC (5,407 KB)
[v2] Mon, 25 Sep 2023 14:58:59 UTC (5,407 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:How Much Temporal Long-Term Context is Needed for Action Segmentation?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:How Much Temporal Long-Term Context is Needed for Action Segmentation?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators