Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Zatsarynna, Olga; Farha, Yazan Abu; Gall, Juergen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2107.09504 (cs)

[Submitted on 18 Jul 2021]

Title:Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Authors:Olga Zatsarynna, Yazan Abu Farha, Juergen Gall

View PDF

Abstract:Anticipating human actions is an important task that needs to be addressed for the development of reliable intelligent agents, such as self-driving cars or robot assistants. While the ability to make future predictions with high accuracy is crucial for designing the anticipation approaches, the speed at which the inference is performed is not less important. Methods that are accurate but not sufficiently fast would introduce a high latency into the decision process. Thus, this will increase the reaction time of the underlying system. This poses a problem for domains such as autonomous driving, where the reaction time is crucial. In this work, we propose a simple and effective multi-modal architecture based on temporal convolutions. Our approach stacks a hierarchy of temporal convolutional layers and does not rely on recurrent layers to ensure a fast prediction. We further introduce a multi-modal fusion mechanism that captures the pairwise interactions between RGB, flow, and object modalities. Results on two large-scale datasets of egocentric videos, EPIC-Kitchens-55 and EPIC-Kitchens-100, show that our approach achieves comparable performance to the state-of-the-art approaches while being significantly faster.

Comments:	CVPR Precognition Workshop
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2107.09504 [cs.CV]
	(or arXiv:2107.09504v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2107.09504

Submission history

From: Olga Zatsarynna [view email]
[v1] Sun, 18 Jul 2021 16:21:35 UTC (854 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators