\useunder

\ul

Space-time Reinforcement Network for Video Object Segmentation

1st Yadang Chen School of Computer Science
Nan**g University of Information Science and Technology
Nan**g, China
   2nd Wentao Zhu School of Computer Science
Nan**g University of Information Science and Technology
Nan**g, China
         3rd Zhi-Xin Yang       State Key Laboratory of Internet of Things for Smart City
      University of Macau
      Macau, China
                              4th Enhua Wu                            Key Laboratory of System Software and State Key Laboratory of Computer Science
                           Institute of Software, Chinese Academy of Sciences
                              Bei**g, China
Abstract

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS.

Index Terms:
Video object segmentation, memory-based methods, auxiliary frame, prototype learning

I Introduction

Semi-supervised video object segmentation (VOS) stands as a challenging task in computer vision, drawing widespread attention in autonomous driving, robotics and video editing. The key to this task is to fully utilize the given limited signals between frames, where the first-frame annotation is provided by the user, and it segments objects in the remaining frames as accurately as possible.

Recently, matching-based methods [1, 2, 3] have achieved great success in semi-supervised VOS, achieving object segmentation through pixel-level matching between query frames and memory frames. Among these methods, memory-based networks [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] attracted a lot of attention. For instance, the space-time storage network (STM) [4] firstly is proposed to construct a feature memory for each object, and applies space-time matching between query and memory frames, which can better solve the problems of object occlusion and drift. After STM [4], there are many variants [8, 9, 10, 11, 14, 16, 17, 18] developed aiming to improve accuracy, reduce memory usage, and so on. Particularly, the latest related state-of-the-art methods, known as STCN [13] and XMem [15], share multiple memory stores for all objects to capture different spatial-temporal contexts, have achieved prominent performance.

Refer to caption

Figure 1: (a) t-SNE visualization of the difference between frames. Left: the feature maps of query and adjacent frames. Right: the feature maps of query and auxiliary frames. Our proposed auxiliary frame is more consistent with the query than the adjacent frame. (b) Comparison of pixel-level matching (top) and prototype-level matching (bottom). Orange arrows indicate wrong matches. We propose prototype-level matching to improve undesired mismatching.

Although these methods have achieved good results, there are still two issues that need to be carefully considered. Firstly, the existing methods [6, 11, 15] investigate the space-time coherence between adjacent video frames to exclude distractor objects, however the coherence is often completely destroyed due to occlusion, fast motion and irregular deformation. This can be referred to Fig.1(a), where the features between the query and its adjacent frame are inconsistent. Secondly, the dense pixel-level matching between the memory features and the query features will leads to undesired mismatching caused by the noises or distractors as shown in Fig.1(b). Thus, a high-level matching is needful as an auxiliary manner for solving the problem.

Motivated by the above discussions, we propose a Space-time Reinforcement Network (SRNet) for video object segmentation to address the two mentioned weaknesses. In more details, the core ideas of this paper come from two aspects: i) Instead of directly exploring the space-time coherence between adjacent video frames, we propose to generate an auxiliary frame from adjacent frames, serving as an implicit short-temporal reference for the query one. The proposed auxiliary frame is more consistent with the query frame as shown in Fig.1(a). ii) Beyond pixel-level matching, we propose to additionally learn a prototype for each video object. Thus, a high-level semantic matching, i.e., prototype-level matching, can be implemented between the query and memory to effectively improve the robustness of the method, also shown in Fig.1(b).

Our contributions can be summarized as follows:

  • We propose to generate an implicit auxiliary frame between adjacent frames to improve the space-time coherence.

  • We introduce a high-level semantic matching, i.e., prototype-level matching, to reduce undesired mismatching caused by the noises or distractors.

  • Our SRNet achieves state-of-the-art performance on DAVIS 2017 and competitive results on YouTube 2018, while maintaining a high inference speed of 32+ FPS.

Refer to caption

Figure 2: An overview of SRNet. We propose a Feature Alignment Module (FAM) for generating an auxiliary frame to obtain the local feature and a Prototype Transformer Module (PTM) to implement prototype-level matching.

II Method

Fig.2 shows an overview of our proposed method SRNet. We firstly outline the recent memory-based advanced VOS methods. Furthermore, we elaborate on the improvements made by our SRNet based on these methods.

Revisit memory-based VOS methods. Since the seminal work of STM [4], memory-based methods have emerged as the predominant solution for VOS. Specifically, they use the current frame FtH0×W0×3subscript𝐹𝑡superscriptsuperscript𝐻0superscript𝑊03F_{t}\in\mathbb{R}^{H^{0}\times W^{0}\times 3}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × 3 end_POSTSUPERSCRIPT as the query and store reference frames {F0,,Ft1}subscript𝐹0subscript𝐹𝑡1\left\{{F_{0},\cdots,F_{t-1}}\right\}{ italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } and masks {M0,,Mt1}subscript𝑀0subscript𝑀𝑡1\left\{{M_{0},\cdots,M_{t-1}}\right\}{ italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } in memory, where H0superscript𝐻0H^{0}italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and W0superscript𝑊0W^{0}italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT represent the initial size of the query frame. Then, the prediction of the mask Mtsuperscript𝑀𝑡M^{t}italic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT can be obtained by space-time matching between query and memory. We revisit the framework of memory-based methods, e.g., STCN [13] and Xmem [15], as they are one of the simplest and most effective memory-based methods.

As shown in Fig.2, given T𝑇Titalic_T memory frames and a query frame, the query frame is passed to the query encoder to generate query key kQH×W×Cksuperscript𝑘𝑄superscript𝐻𝑊superscript𝐶𝑘k^{Q}\in\mathbb{R}^{H\times W\times C^{k}}italic_k start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and query value vQH×W×Cvsuperscript𝑣𝑄superscript𝐻𝑊superscript𝐶𝑣v^{Q}\in\mathbb{R}^{H\times W\times C^{v}}italic_v start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Meanwhile, memory frames and masks are fed to the memory encoder to obtain memory key kMT×H×W×Cksuperscript𝑘𝑀superscript𝑇𝐻𝑊superscript𝐶𝑘k^{M}\in\mathbb{R}^{T\times H\times W\times C^{k}}italic_k start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and memory value vMT×H×W×Cvsuperscript𝑣𝑀superscript𝑇𝐻𝑊superscript𝐶𝑣v^{M}\in\mathbb{R}^{T\times H\times W\times C^{v}}italic_v start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where H𝐻Hitalic_H and W𝑊Witalic_W are spatial dimensions with a stride of 16, Cksuperscript𝐶𝑘C^{k}italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is the dimension of the key space and Cvsuperscript𝐶𝑣C^{v}italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT is the dimension of value space. Then, memory read performs space-time correspondence matching between kQsuperscript𝑘𝑄k^{Q}italic_k start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and kMsuperscript𝑘𝑀k^{M}italic_k start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. We compute the affinity matrix W[0,1]THW×HW𝑊superscript01𝑇𝐻𝑊𝐻𝑊W\in[0,1]^{THW\times HW}italic_W ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_T italic_H italic_W × italic_H italic_W end_POSTSUPERSCRIPT:

W(i,j)=exp(ε(kM(i),kq(j)))iexp(ε(kM(i),kq(j))),𝑊𝑖𝑗𝑒𝑥𝑝𝜀superscript𝑘𝑀𝑖superscript𝑘𝑞𝑗subscript𝑖𝑒𝑥𝑝𝜀superscript𝑘𝑀𝑖superscript𝑘𝑞𝑗\displaystyle W\left({i,j}\right)=\frac{exp\left(\varepsilon\left({k^{M}(i),k^% {q}(j)}\right)\right)}{\sum_{i}{exp\left(\varepsilon\left({k^{M}(i),k^{q}(j)}% \right)\right)}},italic_W ( italic_i , italic_j ) = divide start_ARG italic_e italic_x italic_p ( italic_ε ( italic_k start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_i ) , italic_k start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_j ) ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_ε ( italic_k start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_i ) , italic_k start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_j ) ) ) end_ARG , (1)

where W(i,j)𝑊𝑖𝑗W\left({i,j}\right)italic_W ( italic_i , italic_j ) indicates the similarity between the i-th memory pixel and the j-th query pixel, and ε(,)𝜀\varepsilon\left({\cdot,\cdot}\right)italic_ε ( ⋅ , ⋅ ) is a similarity measure, e.g., L2𝐿2L2italic_L 2 distance.

Then, it multiplies W(i,j)𝑊𝑖𝑗W\left({i,j}\right)italic_W ( italic_i , italic_j ) with vMsuperscript𝑣𝑀v^{M}italic_v start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT to obtain the memory feature FmemH×W×Cvsubscript𝐹𝑚𝑒𝑚superscript𝐻𝑊superscript𝐶𝑣F_{mem}\in\mathbb{R}^{H\times W\times C^{v}}italic_F start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT for the query frame. Finally, Fmemsubscript𝐹𝑚𝑒𝑚F_{mem}italic_F start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT is upsampled in decoder to generate prediction mask MtH0×W0×1subscript𝑀𝑡superscriptsuperscript𝐻0superscript𝑊01M_{t}\in\mathbb{R}^{H^{0}\times W^{0}\times 1}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT .

Refer to caption

Figure 3: Implementation of Feature Alignment Module.

II-A Overview

As shown in Fig.2, our SRNet divides the memory into two blocks, the local memory stores the adjacent frame Ft1subscript𝐹𝑡1F_{t-1}italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and its mask Mt1subscript𝑀𝑡1M_{t-1}italic_M start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, while the global memory holds frames {F0,,Ft2}subscript𝐹0subscript𝐹𝑡2\left\{{F_{0},\cdots,F_{t-2}}\right\}{ italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_F start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT } and their masks {M0,,Mt2}subscript𝑀0subscript𝑀𝑡2\left\{{M_{0},\cdots,M_{t-2}}\right\}{ italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_M start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT } . Then we feed them into the memory decoder to obtain local memory key kLH×W×Cksuperscript𝑘𝐿superscript𝐻𝑊superscript𝐶𝑘k^{L}\in\mathbb{R}^{H\times W\times C^{k}}italic_k start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, local memory value vLH×W×Cvsuperscript𝑣𝐿superscript𝐻𝑊superscript𝐶𝑣v^{L}\in\mathbb{R}^{H\times W\times C^{v}}italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, global memory key kG(T1)×H×W×Cksuperscript𝑘𝐺superscript𝑇1𝐻𝑊superscript𝐶𝑘k^{G}\in\mathbb{R}^{(T-1)\times H\times W\times C^{k}}italic_k start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_T - 1 ) × italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and global memory value vG(T1)×H×W×Cvsuperscript𝑣𝐺superscript𝑇1𝐻𝑊superscript𝐶𝑣v^{G}\in\mathbb{R}^{(T-1)\times H\times W\times C^{v}}italic_v start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_T - 1 ) × italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. We further learn memory prototype G1×1×Cv𝐺superscript11superscript𝐶𝑣G\in\mathbb{R}^{1\times 1\times C^{v}}italic_G ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT by vGsuperscript𝑣𝐺v^{G}italic_v start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. Next, we pass kLsuperscript𝑘𝐿k^{L}italic_k start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, vLsuperscript𝑣𝐿v^{L}italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, kQsuperscript𝑘𝑄k^{Q}italic_k start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and vQsuperscript𝑣𝑄v^{Q}italic_v start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT into a Feature Alignment Module (FAM) to obtain the value of auxiliary frame v¯Lsuperscript¯𝑣𝐿{\overline{v}}^{L}over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and compute the local feature FlocH×W×Cvsubscript𝐹𝑙𝑜𝑐superscript𝐻𝑊superscript𝐶𝑣F_{loc}\in\mathbb{R}^{H\times W\times C^{v}}italic_F start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Then, the memory feature Fmemsubscript𝐹𝑚𝑒𝑚F_{mem}italic_F start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT is obtained by space-time matching beween query and global memory. Through a Prototype Transformer Module (PTM), pixel-level feature FpixH×W×Cvsubscript𝐹𝑝𝑖𝑥superscript𝐻𝑊superscript𝐶𝑣F_{pix}\in\mathbb{R}^{H\times W\times C^{v}}italic_F start_POSTSUBSCRIPT italic_p italic_i italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , which is obtained by fusing Flocsubscript𝐹𝑙𝑜𝑐F_{loc}italic_F start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT and Fmemsubscript𝐹𝑚𝑒𝑚F_{mem}italic_F start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT , are used for prototype-level matching with G𝐺Gitalic_G. They are updated iteratively to obtain the final prototype feature FproH×W×Cvsubscript𝐹𝑝𝑟𝑜superscript𝐻𝑊superscript𝐶𝑣F_{pro}\in\mathbb{R}^{H\times W\times C^{v}}italic_F start_POSTSUBSCRIPT italic_p italic_r italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Finally, decoder upsamples Fprosubscript𝐹𝑝𝑟𝑜F_{pro}italic_F start_POSTSUBSCRIPT italic_p italic_r italic_o end_POSTSUBSCRIPT to generate prediction mask Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

II-B Feature Alignment Module

We introduce FAM in this section, as shown in Fig.3, which aligns the adjacent frame with the query frame to obtain an auxiliary frame, serving as an implicit short-temporal reference for the query one. We do not pursue image-level alignment, but rather learn feature-level alignment between the adjacent frame and query frame. Specifically, what we are learning is not the auxiliary frame, but its value v¯Lsuperscript¯𝑣𝐿{\overline{v}}^{L}over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. Subsequently, We refine v¯Lsuperscript¯𝑣𝐿{\overline{v}}^{L}over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT to obtain the local feature Flocsubscript𝐹𝑙𝑜𝑐F_{loc}italic_F start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT of the query frame.

Given kQsuperscript𝑘𝑄k^{Q}italic_k start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, vQsuperscript𝑣𝑄v^{Q}italic_v start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, kLsuperscript𝑘𝐿k^{L}italic_k start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and vLsuperscript𝑣𝐿v^{L}italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, a uniform grid of points pHG×WG×2𝑝superscriptsubscript𝐻𝐺subscript𝑊𝐺2p\in\mathbb{R}^{H_{G}\times W_{G}\times 2}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT × 2 end_POSTSUPERSCRIPT is generated as a reference. Specifically, the grid size is downsampled by a factor g𝑔gitalic_g based on the input feature map size, HG=H/gsubscript𝐻𝐺𝐻𝑔H_{G}=H/gitalic_H start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_H / italic_g, WG=W/gsubscript𝑊𝐺𝑊𝑔W_{G}=W/gitalic_W start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = italic_W / italic_g. The value of the reference point is a linearly spaced 2D coordinate {(0,0),,(HG1,WG1)}00subscript𝐻𝐺1subscript𝑊𝐺1\left\{(0,0),\cdot\cdot\cdot,\left(H_{G}-1,W_{G}-1\right)\right\}{ ( 0 , 0 ) , ⋯ , ( italic_H start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - 1 , italic_W start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - 1 ) }, and utilizing the grid shape HG×WGsubscript𝐻𝐺subscript𝑊𝐺H_{G}\times W_{G}italic_H start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is normalized to a range of [-1,+1], where (+1,+1) represents the lower right corner and (-1,-1) represents the upper left corner. In order to obtain the offset of each reference point, kQsuperscript𝑘𝑄k^{Q}italic_k start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and kLsuperscript𝑘𝐿k^{L}italic_k start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are input into the lightweight separable convolution ϕ()italic-ϕ\phi~{}(\cdot)italic_ϕ ( ⋅ ) to generate the offset ΔpΔ𝑝\mathrm{\Delta}proman_Δ italic_p. Then, the local memory value vLsuperscript𝑣𝐿v^{L}italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and the offset ΔpΔ𝑝\mathrm{\Delta}proman_Δ italic_p are inputted into the sampling function τ(,)𝜏\tau(\cdot,\cdot)italic_τ ( ⋅ , ⋅ ) to generate the value of auxiliary frame v¯Lsuperscript¯𝑣𝐿{\overline{v}}^{L}over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

Δp=ϕ(kQkL),v¯L=τ(vL,p+Δp),formulae-sequenceΔ𝑝italic-ϕdirect-sumsuperscript𝑘𝑄superscript𝑘𝐿superscript¯𝑣𝐿𝜏superscript𝑣𝐿𝑝Δ𝑝\displaystyle\mathrm{\Delta}p=\phi\left({k^{Q}\oplus k^{L}}\right),\overline{v% }^{L}=\tau\left(v^{L},p+\mathrm{\Delta}p\right),roman_Δ italic_p = italic_ϕ ( italic_k start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ⊕ italic_k start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) , over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_τ ( italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , italic_p + roman_Δ italic_p ) , (2)

Specifically, we will set τ(,)𝜏\tau(\cdot,\cdot)italic_τ ( ⋅ , ⋅ ) to bilinear interpolation to make it differentiable:

τ(vL;(px,py))=(vx,vy)g(px,vx)g(py,vy)vL[vy,vx,:],𝜏superscript𝑣𝐿subscript𝑝𝑥subscript𝑝𝑦subscriptsubscript𝑣𝑥subscript𝑣𝑦𝑔subscript𝑝𝑥subscript𝑣𝑥𝑔subscript𝑝𝑦subscript𝑣𝑦superscript𝑣𝐿subscript𝑣𝑦subscript𝑣𝑥:\displaystyle\tau\left({v^{L};\left({p_{x},p_{y}}\right)}\right)={\sum\limits_% {({v_{x},v_{y}})}{g\left({p_{x},v_{x}}\right)}}g\left({p_{y},v_{y}}\right){v^{% L}\left[v_{y},v_{x},:\right]},italic_τ ( italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ; ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ) = ∑ start_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_g ( italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) italic_g ( italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT [ italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , : ] , (3)

where g(a,b)=max(0,1|ab|)g(a,~{}b)~{}=~{}max\left(0,1-\middle|a-b\middle|\right)italic_g ( italic_a , italic_b ) = italic_m italic_a italic_x ( 0 , 1 - | italic_a - italic_b | ) and (vx,vy)subscript𝑣𝑥subscript𝑣𝑦\left({v_{x},v_{y}}\right)( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) indices all positions on vLH×W×Cvsuperscript𝑣𝐿superscript𝐻𝑊superscript𝐶𝑣v^{L}\in\mathbb{R}^{H\times W\times C^{v}}italic_v start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Given v¯Lsuperscript¯𝑣𝐿{\overline{v}}^{L}over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and vQsuperscript𝑣𝑄v^{Q}italic_v start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, we estimate the convolutional kernel sampling offset O𝑂Oitalic_O and weight coefficient W𝑊Witalic_W :

O=conv(v¯LvQ),𝑂𝑐𝑜𝑛𝑣direct-sumsuperscript¯𝑣𝐿superscript𝑣𝑄\displaystyle O=conv\left({\overline{v}}^{L}\oplus v^{Q}\right),italic_O = italic_c italic_o italic_n italic_v ( over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⊕ italic_v start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) ,
W=conv(v¯LvQ),𝑊𝑐𝑜𝑛𝑣direct-sumsuperscript¯𝑣𝐿superscript𝑣𝑄\displaystyle W=conv\left({\overline{v}}^{L}\oplus v^{Q}\right),italic_W = italic_c italic_o italic_n italic_v ( over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⊕ italic_v start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) , (4)

where conv()𝑐𝑜𝑛𝑣conv\left(\cdot\right)italic_c italic_o italic_n italic_v ( ⋅ ) is a convolutional layer with a kernel size of 3 × 3, the sampling offset O𝑂Oitalic_O and weight coefficient W𝑊Witalic_W respectively represent the positional shift and intensity fluctuation of each pixel in v¯Lsuperscript¯𝑣𝐿{\overline{v}}^{L}over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT relative to vQsuperscript𝑣𝑄v^{Q}italic_v start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT.

Refer to caption

Figure 4: Implementation of Prototype Transformer Module.

Subsequently, given the value of the auxiliary frame v¯Lsuperscript¯𝑣𝐿{\overline{v}}^{L}over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, sampling offset O𝑂Oitalic_O and weight coefficient W𝑊Witalic_W as inputs, local feature Flocsubscript𝐹𝑙𝑜𝑐F_{loc}italic_F start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT can be computed:

Floc=deconv(v¯L,O,W),subscript𝐹𝑙𝑜𝑐𝑑𝑒𝑐𝑜𝑛𝑣superscript¯𝑣𝐿𝑂𝑊\displaystyle F_{loc}=deconv\left({\overline{v}}^{L},O,W\right),italic_F start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT = italic_d italic_e italic_c italic_o italic_n italic_v ( over¯ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , italic_O , italic_W ) , (5)

where deconv(,,)𝑑𝑒𝑐𝑜𝑛𝑣deconv\left(\cdot,\cdot,\cdot\right)italic_d italic_e italic_c italic_o italic_n italic_v ( ⋅ , ⋅ , ⋅ ) is a modulated deformable convolution layer[19].

TABLE I: Quantitative comparison of YouTube 2018/2019 validation set, DAVIS 2017/2016 validation set, and 2017 test set. S𝑆Sitalic_S and U𝑈Uitalic_U represent visible or invisible categories. Red represents the best result, while blue represents the second-best result.
YT-VOS 2018 Val YouTube 2019 Val DAVIS 2017 Val DAVIS 2016 Val DAVIS 2017 Test
Model 𝒢𝒢\mathcal{G}caligraphic_G 𝒥ssubscript𝒥𝑠\mathcal{J}_{s}caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ssubscript𝑠\mathcal{F}_{s}caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 𝒥usubscript𝒥𝑢\mathcal{J}_{u}caligraphic_J start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT usubscript𝑢\mathcal{F}_{u}caligraphic_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT 𝒢𝒢\mathcal{G}caligraphic_G 𝒥ssubscript𝒥𝑠\mathcal{J}_{s}caligraphic_J start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ssubscript𝑠\mathcal{F}_{s}caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 𝒥usubscript𝒥𝑢\mathcal{J}_{u}caligraphic_J start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT usubscript𝑢\mathcal{F}_{u}caligraphic_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F 𝒥𝒥\mathcal{J}caligraphic_J \mathcal{F}caligraphic_F 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F 𝒥𝒥\mathcal{J}caligraphic_J \mathcal{F}caligraphic_F 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F 𝒥𝒥\mathcal{J}caligraphic_J 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F FPS
STM [4] 79.4 79.7 84.2 72.8 80.9 - - - - - 81.8 79.2 84.3 89.3 88.7 89.9 72.2 69.3 75.2 11.1
CFBI [5] 81.4 81.1 85.8 75.3 83.4 - - - - - 81.9 79.1 84.6 89.4 88.3 90.5 78.0 74.4 81.6 5.9
RMNet [6] 81.5 82.1 85.7 75.7 82.4 - - - - - 83.5 81.0 86.0 88.8 88.9 88.7 75.0 71.9 78.1 4.4
HMMN [8] 82.6 82.1 87.0 76.8 84.6 82.5 81.7 86.1 77.3 85.0 84.7 81.9 87.5 90.8 89.6 92.0 78.6 74.7 82.5 9.3
STCN [13] 83.0 81.9 86.5 77.9 85.7 82.7 81.1 85.4 78.2 85.9 85.4 82.2 88.6 \ul91.6 90.8 \ul92.5 76.1 73.1 80.0 20.2
AOT [9] 84.1 83.7 88.5 78.1 86.1 84.1 83.5 88.1 78.4 86.3 84.9 82.3 87.5 91.1 90.1 92.1 78.8 75.3 82.3 12.1
RDE [10] - - - - - 81.9 81.1 85.5 76.2 84.8 84.2 80.8 87.5 91.1 89.7 92.5 77.4 73.6 81.2 27.0
XMem [15] 85.7 84.6 89.3 80..2 88.7 85.5 84.3 88.6 80.3 88.6 \ul86.2 \ul82.9 \ul89.5 91.5 90.4 92.7 81.0 77.4 84.5 22.6
SRNet \ul85.0 \ul83.9 \ul88.5 \ul79.7 \ul87.9 \ul84.7 \ul82.9 \ul87.4 \ul80.3 \ul88.3 86.4 83.1 89.7 91.8 \ul90.6 93.0 \ul79.2 \ul76.5 \ul81.9 32.3

Refer to caption

Figure 5: Qualitative comparisons with SRNet, STCN [13] and Xmem [15] on the YouTube 2018 validation set and DAVIS 2017 validation set.
TABLE II: Module ablation study.
FAM PTM 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F 𝒢𝒢\mathcal{G}caligraphic_G FPS
86.4 85.0 32.3
86.0 84.6 37.5
85.8 84.3 38.7

II-C Prototype Transformer Module

In fact, pixel-level matching will leads to undesired mismatching caused by the noises or distractors. To address this issue, we propose a Prototype Transformer Module (PTM) as shown in Fig.4 which performs prototype-level matching between query and memory.

We can use mask annotations on the memory frames to learn the foreground prototype. There are two strategies for utilizing segmentation masks, namely early fusion and late fusion. Early fusion will mask the supporting images before inputting them into the feature extractor. Late fusion directly masks feature maps to generate foreground features separately. In this work, we adopt an early fusion strategy because we can reuse the vGsuperscript𝑣𝐺v^{G}italic_v start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT obtained through the memory encoder.

Specifically, given vGsuperscript𝑣𝐺v^{G}italic_v start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, the formula for calculating the initial prototype G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and initial pixel feature F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

G0=i=1THWF(viG)W(viG)i=1THWW(viG),subscript𝐺0superscriptsubscript𝑖1𝑇𝐻𝑊𝐹superscriptsubscript𝑣𝑖𝐺𝑊superscriptsubscript𝑣𝑖𝐺superscriptsubscript𝑖1𝑇𝐻𝑊𝑊superscriptsubscript𝑣𝑖𝐺\displaystyle G_{0}=\frac{\sum_{i=1}^{THW}{F\left(v_{i}^{G}\right)W\left(v_{i}% ^{G}\right)}}{\sum_{i=1}^{THW}{W\left(v_{i}^{G}\right)}},italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T italic_H italic_W end_POSTSUPERSCRIPT italic_F ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) italic_W ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T italic_H italic_W end_POSTSUPERSCRIPT italic_W ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) end_ARG , (6)
F0=fusion(Fmem,Floc),subscript𝐹0𝑓𝑢𝑠𝑖𝑜𝑛subscript𝐹𝑚𝑒𝑚subscript𝐹𝑙𝑜𝑐\displaystyle F_{0}=fusion\left(F_{mem},F_{loc}\right),italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f italic_u italic_s italic_i italic_o italic_n ( italic_F start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT ) , (7)

where, F()𝐹F(\cdot)italic_F ( ⋅ ) is a 2-layer, Cvsuperscript𝐶𝑣C^{v}italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT-dimensional MLP, W()𝑊W(\cdot)italic_W ( ⋅ ) is a 2-layer, 1-dimensional MLP and fusion(,)𝑓𝑢𝑠𝑖𝑜𝑛fusion\left(\cdot,\cdot\right)italic_f italic_u italic_s italic_i italic_o italic_n ( ⋅ , ⋅ ) includes two ResBlocks and a CBAM block.

Our PTM has learnable prototype G𝐺Gitalic_G and pixel feature F𝐹Fitalic_F, which iteratively updated by initial prototype G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and initial pixel feature F0subscript𝐹0F_{0}italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through cross attention. We added position embeddings to the key of each attention layer.

q=G0Wq,k=F0Wk,v=F0Wv,formulae-sequence𝑞subscript𝐺0subscript𝑊𝑞formulae-sequence𝑘subscript𝐹0subscript𝑊𝑘𝑣subscript𝐹0subscript𝑊𝑣\displaystyle q=G_{0}W_{q},~{}k~{}=F_{0}W_{k},v~{}=F_{0}W_{v},italic_q = italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_k = italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v = italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (8)
G1=MLP(softmax(q(k+Pk)T)vG0),subscript𝐺1𝑀𝐿𝑃direct-product𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑞superscript𝑘subscript𝑃𝑘𝑇𝑣subscript𝐺0\displaystyle G_{1}=MLP\left(softmax\left(q\left(k+P_{k}\right)^{T}\right)v% \odot G_{0}\right),italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_q ( italic_k + italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) italic_v ⊙ italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (9)

where, Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the learnable parameters in SPA.

Then, guided by the updated prototype G𝐺Gitalic_G, the semantic information of the corresponding pixels in pixel feature F𝐹Fitalic_F is activated. Specifically, G𝐺Gitalic_G is extended and combined with F𝐹Fitalic_F to activate the target area:

F1=φ(expand(G0)F0),subscript𝐹1𝜑direct-product𝑒𝑥𝑝𝑎𝑛𝑑subscript𝐺0subscript𝐹0\displaystyle F_{1}=\varphi\left(expand\left(G_{0}\right)\odot F_{0}\right),italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_φ ( italic_e italic_x italic_p italic_a italic_n italic_d ( italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⊙ italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (10)

where φ()𝜑\varphi(\cdot)italic_φ ( ⋅ ) is a simple activation network consisting of two 3 × 3 convolutional layers and one channel attention layer.

Finally, we use G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the new G𝐺Gitalic_G and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as the new F𝐹Fitalic_F, and update them three times to obtain the final prototype feature Fprosubscript𝐹𝑝𝑟𝑜F_{pro}italic_F start_POSTSUBSCRIPT italic_p italic_r italic_o end_POSTSUBSCRIPT.

Refer to caption

Figure 6: Ablation study of different sampling intervals on local memory in FAM.

III Experiment

For assessment purposes, we utilize standard metrics: the Jaccard index 𝒥𝒥\mathcal{J}caligraphic_J, contour accuracy \mathcal{F}caligraphic_F, and their combined average 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F. In YouTubeVOS [20], the computation of 𝒥𝒥\mathcal{J}caligraphic_J and \mathcal{F}caligraphic_F is performed separately for “seen” and “unseen” categories. 𝒢𝒢\mathcal{G}caligraphic_G represents the averaged 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F across both “seen” and “unseen” categories.

III-A Quantitative Comparison

Table I tabulate our results on YouTube 2018/2019 [20] validation, DAVIS 2016/2017 [21] validation, and DAVIS 2017 test-dev. On the DAVIS 2017 and 2016 validation set, our SRNet outperforms the baseline STCN [13] by 1% and 0.2% in 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F and runs about 60% faster and outperforms the state-of-the-art method Xmem [15] by 0.2% and 0.3% in 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F and runs about 43% faster. On the YouTube 2018 validation and DAVIS 2017 test-dev set, SRNet performs 2% and 2.1% better on 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F than STCN [13], but 0.7% and 1.8% less than Xmem [15]. Thanks to setting our Cvsuperscript𝐶𝑣C^{v}italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT at 256, rather than 512, our inference speed far exceeds other methods. We choose STCN [13] as the baseline because its GPU usage is 20% lower than XMem [15] and nearly twice as fast when training.

Refer to caption

Figure 7: Ablation study of different numbers of PTM layers.

Refer to caption

Figure 8: Visualization of auxiliary masks at different layers of PTM.

III-B Qualitative Comparison

We choose two videos as examples, one from DAVIS 2017 validation and another from YouTube 2018. and present the segmentation results in comparison with SRNet, STCN [13] and Xmem [15] in Fig.5. As can be seen, our SRNet produces more accurate masks: similar objects can be distinguished, and undesired mismatching can be reduced.

III-C Ablations

We conduct ablation experiments on DAVIS 2017 validation and YouTube 2018 validation. In particular, we report 𝒥&𝒥\mathcal{J}\&\mathcal{F}caligraphic_J & caligraphic_F and FPS for DAVIS 2017, alongside 𝒢𝒢\mathcal{G}caligraphic_G for YouTube 2018.

Ablation study on the two modules of SRNet is conducted to evaluate their individual effectiveness, as shown in Table II. When two modules are activated (i.e., in the default configuration), scoring 86.4% on DAVIS 2017 and 85.0% on YouTube 2018, with an inference speed reaching 32.3 FPS. When FAM is disabled, the result decreases 0.4% and 0.4%. Then, When PTM is disabled, the result decreases 0.6% and 0.7%.

To verify the impact of sampling intervals of local memory in FAM, we choose different intervals for the experiment, as shown in the Fig.6.

An analysis of different numbers of PTM layers on the effectiveness of our network in Fig.7. In order to better balance accuracy and FPS, we choose to iterate three times. Additionally, we visualized each layer’s auxiliary mask in Fig.8. It can be observed that the object becomes more coherent (red arrows), and errors are suppressed (yellow arrows). In Fig.9, it can be seen that the distance between pixels of the same class in pixel feature is far, while the distance between pixels of different classes is close. Obviously,the prototype feature after PTM has effectively reduced intra-class variation and enlarged inter-class difference.

Refer to caption

Figure 9: Visualization results to compare prototype feature (bottom) with pixel feature (top) on three videos (three columns). Varying colors correspond to foreground pixels from different classes. Each point represents a feature vector with shape of 1×1×Cvsuperscript𝐶𝑣C^{v}italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT at a position in the feature, where Cvsuperscript𝐶𝑣C^{v}italic_C start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT indicates the dimension of the value space.

Refer to caption

Figure 10: Failure cases: the fine-grained objects are not well segmented. The first column is the first frame with ground truth. Error predictions are highlighted with red bounding boxes.

III-D Limitations

As shown in Fig.10, SRNet typically fails in small-scale foreground objects. We believe this is mainly because our method cannot capture fine-grained information. We hope to find more effective methods to distinguish small objects and achieve better performance in future work.

IV Conclusion

In this paper, we tackle the issues of memory-based methods for semi-supervised video object segmentation (VOS). For improving the space-time coherence, we first propose to generate an implicit auxiliary frame between adjacent frames. Secondly, we introduce a high-level semantic matching, i.e., prototype-level matching to reduce undesired mismatching caused by the noises or distractors. The experiment results show that our SRNet achieves competitive performance.

V Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant 62332015, and Grant 62072449; in part by the Science and Technology Development Fund, Macau, SAR, under Grant 0075/2023/AMJ, Grant 0003/2023/RIB1, and Grant 001/2024/SKL; in part by the Zhuhai Science and Technology Innovation Bureau under Grant ZH2220004002524; and in part by the University of Macau under Grant MYRG2022-00059-FST and Grant MYRG-GRG2023-00237-FST-UMDF.

References

  • [1] P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen, “Feelvos: Fast end-to-end embedding learning for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9481–9490.
  • [2] L. Fu, Z. Li, Q. Ye, H. Yin, Q. Liu, X. Chen, X. Fan, W. Yang, and G. Yang, “Learning robust discriminant subspace based on joint L2,psubscriptL2𝑝\mathrm{L}_{2,p}roman_L start_POSTSUBSCRIPT 2 , italic_p end_POSTSUBSCRIPT- and L2,ssubscriptL2𝑠\mathrm{L}_{2,s}roman_L start_POSTSUBSCRIPT 2 , italic_s end_POSTSUBSCRIPT-norm distance metrics,” IEEE transactions on neural networks and learning systems, vol. 33, no. 1, pp. 130–144, 2020.
  • [3] Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by foreground-background integration,” in European Conference on Computer Vision.   Springer, 2020, pp. 332–348.
  • [4] S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9226–9235.
  • [5] Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by multi-scale foreground-background integration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4701–4712, 2021.
  • [6] H. Xie, H. Yao, S. Zhou, S. Zhang, and W. Sun, “Efficient regional memory network for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1286–1295.
  • [7] Q. Huang, L. Shen, R. Zhang, J. Cheng, S. Ding, Z. Zhou, and Y. Wang, “Hdmixer: Hierarchical dependency with extendable patch for multivariate time series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 608–12 616.
  • [8] H. Seong, S. W. Oh, J.-Y. Lee, S. Lee, S. Lee, and E. Kim, “Hierarchical memory matching network for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 889–12 898.
  • [9] Z. Yang, Y. Wei, and Y. Yang, “Associating objects with transformers for video object segmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 2491–2502, 2021.
  • [10] M. Li, L. Hu, Z. Xiong, B. Zhang, P. Pan, and D. Liu, “Recurrent dynamic embedding for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1332–1341.
  • [11] Y. Chen, D. Zhang, Z.-x. Yang, and E. Wu, “Robust and efficient memory network for video object segmentation,” arXiv preprint arXiv:2304.11840, 2023.
  • [12] Q. Ye, P. Huang, Z. Zhang, Y. Zheng, L. Fu, and W. Yang, “Multiview learning with robust double-sided twin svm,” IEEE transactions on Cybernetics, vol. 52, no. 12, pp. 12 745–12 758, 2021.
  • [13] H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Rethinking space-time networks with improved memory coverage for efficient video object segmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 781–11 794, 2021.
  • [14] Y. Chen, C. Hao, Z.-X. Yang, and E. Wu, “Fast target-aware learning for few-shot video object segmentation,” Science China Information Sciences, vol. 65, no. 8, p. 182104, 2022.
  • [15] H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” in European Conference on Computer Vision.   Springer, 2022, pp. 640–658.
  • [16] M. Li, L. Hu, Z. Xiong, B. Zhang, P. Pan, and D. Liu, “Recurrent dynamic embedding for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1332–1341.
  • [17] K. Park, S. Woo, S. W. Oh, I. S. Kweon, and J.-Y. Lee, “Per-clip video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1352–1361.
  • [18] Y. Chen, D. Zhang, Y. Zheng, Z.-X. Yang, E. Wu, and H. Zhao, “Boosting video object segmentation via robust and efficient memory network,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [19] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9308–9316.
  • [20] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, “Youtube-vos: A large-scale video object segmentation benchmark,” arXiv preprint arXiv:1809.03327, 2018.
  • [21] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017.