Streaming Video Model

Zhao, Yucheng; Luo, Chong; Tang, Chuanxin; Chen, Dongdong; Codella, Noel; Zha, Zheng-Jun

Abstract:Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract spatiotemporal features, while frame-based video tasks, such as multiple object tracking (MOT), rely on single fixed-image backbone to extract spatial features. In contrast, we propose to unify video understanding tasks into one novel streaming video architecture, referred to as Streaming Vision Transformer (S-ViT). S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task. We believe that the concept of streaming video model and the implementation of S-ViT are solid steps towards a unified deep learning architecture for video understanding. Code will be available at this https URL.

Comments:	Accepted by CVPR'23
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.17228 [cs.CV]
	(or arXiv:2303.17228v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.17228

Computer Science > Computer Vision and Pattern Recognition

Title:Streaming Video Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators