1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

Mingqi Gao^1,4²²2Equal contributions. **gnan Luo^2,²²2Equal contributions. **yu Yang^1,¹¹1Corresponding authors. Jungong Han^3,4 Feng Zheng^1,2,¹¹1Corresponding authors.
¹Tapall.ai ²Southern University of Science and Technology
³University of Sheffield ⁴University of Warwick

Team: Tapall.ai
{mingqi.gao,**yu.yang}@tapall.ai, [email protected]
[email protected], [email protected]

Abstract

Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many new challenges to the field of referring video object segmentation (RVOS). In this technical report, we investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting. Our solution achieves a $\mathcal{J}\&\mathcal{F}$ score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge. The code is available at: https://github.com/Tapall-AI/MeViS_Track_Solution_2024.

1 Introduction

Pixel-level Video Understanding in the Wild (PVUW) is a workshop providing large-scale datasets, competitions, and discussions for video scene parsing, one of the fundamental problems in computer vision. Since 2021, PVUW has encouraged much improvement in video semantic segmentation and video panoptic segmentation. This year, two new subjects join PVUW: 1) Complex Video Object Segmentation (MOSE) [7] and 2) Motion Expression guided Video Segmentation (MeViS) [6], supplementing PVUW with object-centric pixel-level video understanding, vital to many real-world applications such as video editing and human-computer interactive systems. With large-scale videos, diverse/realistic challenges, and high-quality annotations, MOSE and MeViS build a competitive platform and encourage comprehensive and robust solutions.

This technical report focuses on the MeViS subject: Motion Expression guided Video Segmentation, which aims to segment target objects in videos, guided by natural language expressions. Before MeViS, several benchmarks [9, 14, 22] have been proposed to encourage explorations in this field. Despite fostering surging research works, these benchmarks focus more on short videos with less same-category objects and static attributes (e.g., location and appearance). As a result, frame-level segmentation also predicts high-quality masks. This would weaken the investigation of temporal properties, vital to understanding real-world videos.

Recently, MeViS (Motion expressions Video Segmentation) [6] has been proposed to emphasise temporal properties in RVOS. Compared with previous benchmarks, MeViS brings several unique challenges: 1) Motion-dominant language expressions, 2) Complex scenes with multiple same-category instances, 3) One-to-more text-object pairs, and 4) Long videos. These challenges encourage RVOS to focus on dynamic attributes, comprehensive multi-modal interactions, and efficient long-term video understanding.

The success of MTTR [1] and ReferFormer [25] motivates the community to consider transformer-based end-to-end architecture [2, 31] as the dominant paradigm. Given an input video and text, the paradigm encodes object queries from all frames and decodes text-relevant ones into masks. The difference between MTTR and ReferFormer lies in query encoding: The former encodes general object queries, and the latter considers language-guided queries, enlightening most following works. These could be roughly divided into two categories: 1) Robust vision-language alignments [16, 15, 20], which focus on the alignments between visual/textual properties; and 2) Temporal-aware interactions [10, 17, 27], where improvements leverage spatial and temporal properties to segment the targets. The more recent ones, SOC [17] and MUTR [27], achieve SoTA performance on previous benchmarks due to their efficient interactions between object sequences and texts. This idea comes from VITA [12], a video instance segmentation method, which also inspired the latest RVOS SoTA: DsHMP [11]. Despite good results, they struggle with MeViS since they are trained on static-dominant data. In addition, MeViS consists of long videos, challenging RVOS in comprehensive video understanding and efficiency.

With these new and realistic challenges, this report improves existing RVOS methods in training and inference schemes. Specifically, we consider MUTR [27] as the baseline. With pre-trained weights on Ref-COCO series [30, 19] and Ref-YouTube-VOS [22], we fine-tune them on MeViS. Masks with one-to-more text-object pairs are considered as a whole to encourage adaptive object perception based on texts. To balance comprehensive understanding and efficiency, we split long input videos into sub-videos via frame sampling. With these improvements, our solution ranks 1st in the MeViS Track.

Experiments on the MeViS valid set indicate that previous RVOS data still contribute to this challenging setting due to their sufficient and well-aligned object masks and texts. In addition, ablations on sampling schemes reveal that there is much room for improvement in temporal modelling over long videos. Specifically, limited by computational resources, the temporal modules are trained with pseudo videos with less frames. During inference, however, videos have much more temporal contexts. This inconsistency leads to considering fewer frames (sampled) in temporal modules outperform the one with all frames. We hope these findings are helpful for future research.

2 Related Works

This sections overviews representative methods and trends in referring and semi-supervised video object segmentation.

2.1 Referring Video Object Segmentation

Recent RVOS methods build their end-to-end architectures upon transformers. Specifically, given input videos and texts, the methods initialise fixed-number object queries to integrate vision-language contexts. Queries from different frames are considered a trajectory if they have the same index. The trajectory best matching with texts is decoded into masks on each frame. As pioneer works, MTTR [1] and ReferFormer [25] build foundation architectures with visual/textual encoders, multi-modal transformers, and mask decoders, motivating many following works. They improve RVOS in mainly two aspects: 1) Robust multi-modal alignments and 2) Temporal-aware interactions.

With cyclic structural consensus, R2VOS [16] shows better results when text-referred objects are absent from frames. In SgMg [20], a spectrum-based multi-modal attention is proposed to improve query-guided mask predictions. FS-RVOS [15] improves RVOS to adapt new visual/textual concepts via the cross-modal affinity. As a versatile model, UNINEXT [26] unifies different object perception tasks and data, generalising well on previous RVOS benchmarks.

Previous methods rarely consider temporal properties or achieve this implicitly, e.g., with video-swin-transformer as backbone. In HTML [10], vision and language information are interacted over hierarchical temporal contexts. Motivated by VITA [12], which validates video understanding can be achieved via associating frame-level objects, recent methods encode temporal properties only from object queries, leading to end-to-end and efficient architectures: SOC [17], MUTR [27], and DsHMP [11]. The former two emphasise mutual multi-modal fusion and achieve SoTA performance on previous benchmarks, while the latter perform hierarchical multi-modal interactions and show high-quality results on ALL RVOS benchmarks.

2.2 Semi-supervised Video Object Segmentation

Unlike RVOS, which specifies the target objects with texts, semi-supervised video object segmentation (SVOS) considers manually annotated masks (usually on the first frame) as targets [8]. Therefore, SVOS methods focus on dense correspondence between frames and can propagate high-quality masks from one or several frames to the whole video.

With this feature, most winner solutions [24, 18] from previous RVOS competitions use SVOS methods to refine their results. In brief, they first select high-confident masks from overall predictions. Then, the masks are propagated to remaining frames to refine their corresponding results. The intuition behind the idea is that the offline RVOS methods struggle to generate spatial-temporal consistent object masks. This could be mitigated significantly via powerful SVOS methods (once the selected masks are high-quality).

Memory-based paradigm (since STM [21]) has dominated SVOS due to its efficient, robust, and dense correspondence between frames. In particular, the paradigm considers not only the first frame annotations but also predictions from intermediate frames as references. This way, SVOS could better adapt to object changes.

Refer to caption — Figure 1: Overview of our solution. Given an input video, we divide all frames into $N$ subsets via non-continuous sampling. Here we take two subsets as an example. They are marked with Blue and Green boxes. In particular, each subset is segmented individually, guided by the input text, and combined for the final results.

Earlier SoTAs [23, 4] improve STM with robust cross-frame correspondence. The recent focus has been shifted to more challenging and realistic settings: long videos and complex scenes, motivating high-quality benchmarks [7, 13] and solutions. Specifically, XMem [3] is proposed to segment long videos with dynamic memory management. AOT series [28, 29] consider object representations to enhance the robustness against complex scenes. By integrating object queries into dense correspondence, Cutie [5] significantly reduces the matching noise between frames and achieves the SoTA SVOS performance.

3 Method

Fig. 1 shows our solution, where we consider MUTR [27] as the base model, with Swin-Transformer-Large as vision encoder and RoBERTa-base as text encoder. Given an input video with $T$ frames ( $\mathcal{V}=\{v_{t}\in\mathbb{R}^{H\times W\times 3}\}^{T}_{t=1}$ ) and referring text $\mathcal{E}=\{e_{i}\}^{L}_{i=1}$ with $L$ words, we first sample $\mathcal{V}$ into $N$ subsets: $\{\mathcal{V}_{n}\}_{n=1}^{N}$ . Then, we segment each subset individually under the guidance from $\mathcal{E}$ , achieving mask subsets: $\{\mathcal{M}_{n}\}_{n=1}^{N}$ . Finally, the masks are combined for the final predictions: $\mathcal{M}=\{m_{t}\in\mathbb{R}^{H\times W}\}_{t=1}^{T}$ .

Training details.

With MUTR’s weights jointly trained on Ref-COCO [30], Ref-COCO+ [30], Ref-COCOg [19], and Ref-YouTube-VOS [22], we perform fine-tuning on MeViS training videos. For the expressions specifying multiple objects, we consider all masks as a whole to encourage the model to perceive and segment all objects from videos. To better leverage pre-trained parameters, we follow MUTR to sample five frames as a pseudo video and use the same losses. The fine-tuning is performed for two epochs, where the learning rate is reduced to 10% during the last one.

Inference details.

Given an input video, we resize each frame to keep its shorter size at 360. Unlike previous RVOS benchmarks, MeViS videos consist of much frames and thus cannot be inferred with one feed-forward pass. As diagrammed in Fig. 1, we sample the video into $N=T\mid T_{c}$ subsets and perform referring segmentation individually. $T_{c}=30$ is the length of each subset.

4 Experiments

This section first shows our quantitative and qualitative results on the MeViS test set. Then, we provide ablations on MeViS validation set to show the solution’s effectiveness and try to derive some insights for future research.

4.1 Main Results

Tab. 1 shows our quantitative results on the MeViS test set.

Team	$\mathcal{J}\&\mathcal{F}$	$\mathcal{J}$	$\mathcal{F}$
Tapall.ai	0.5447 (1)	0.5048 (2)	0.5846 (1)
BBBiiinnn	0.5420 (2)	0.5097 (1)	0.5743 (2)
PPPPPsanG	0.5280 (3)	0.4853 (3)	0.5707 (3)
times	0.5151 (4)	0.4610 (4)	0.5691 (4)
Phan	0.5075 (5)	0.4562 (5)	0.5588 (5)
LIULINKAI	0.4267 (6)	0.3927 (6)	0.4607 (6)
ntuLC	0.3700 (7)	0.3407 (7)	0.3994 (7)

Table 1: Quantitative results on the MeViS test set.

Method	Backbone	Prev.	MeViS	$\mathcal{J}\&\mathcal{F}$
MUTR [27]	Swin-L	✓	✗	0.4343
MUTR [27]	Swin-L	✓	✓	0.4857
SOC [17]	V-Swin-B	✓	✗	0.4394
SOC [17]	V-Swin-B	✓	✓	0.4664

Table 2: Ablations on training data. ‘Prev.’ indicates the use of Ref-COCO, Ref-COCO+, Ref-COCOg, and Ref-YouTube-VOS.

Method	1	5	10	20	30	40
N-continuous	0.4725	0.4788	0.4811	0.4849	0.4857	0.4864
Continuous	0.4725	0.4831	0.4855	0.4875	0.4864	0.4892
No sampling	0.4607	0.4685	0.4730	0.4792	0.4822	0.4732

Table 3: Ablations on sampling methods. Five sub-video lengths are considered. The highlighted results and the results evaluated in Tab. 1 come from the same model.

4.2 Ablations

To validate the effectiveness of our solution, we show results in Fig. 3 and ablations on the MeViS valid set.

Training method.

Tab. 2 compares $\mathcal{J}\&\mathcal{F}$ on different training data. To generalise the conclusion, we take another RVOS method with temporal properties (SOC [17]) into account. MUTR and SOC share the same training and inference procedure. It is observed that the training data in previous benchmarks still contribute to this challenging setting, due to their sufficient and well-aligned object-text pairs. Fig. 4 shows that previous benchmarks enable RVOS methods to perceive objects. With MeViS and unified mask supervision, the methods work on more challenging applications with motion expressions or ones with plural nouns.

Sampling method.

Tab. 3 ablates methods and hyper-parameters for sampling frames. The difference between these methods is diagrammed in Fig. 2. Although temporal modules in MUTR enable us to collect and infer long-term object queries from videos, they are only trained with pseudo videos with five frames. The gap between training and inference temporal contexts struggles with temporal interactions over long videos. Results in Tab. 3 and Fig. 5 show that temporal modules works better than frame-level predictions (sub-video length=1) but the performance cannot be improved further with more temporal contexts.

5 Conclusion

This technical report explores the value of training data and temporal contexts for the challenging MeViS benchmark. The competitive results and ablations demonstrate that the well-aligned object-text data (even with primarily the static attributes) are helpful in motion expression-guided referring video segmentation. In addition, we investigate the effectiveness of temporal contexts and reveal room for improvement in the temporal multi-modal analysis of long videos.

References

Botach et al. [2022] Adam Botach, Evgenii Zheltonozhskii, and Chaim Baskin. End-to-end referring video object segmentation with multimodal transformers. In CVPR, pages 4985–4995, 2022.
Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
Cheng and Schwing [2022] Ho Kei Cheng and Alexander G Schwing. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, pages 640–658. Springer, 2022.
Cheng et al. [2021] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. NeurIPS, 34:11781–11794, 2021.
Cheng et al. [2024] Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In CVPR, 2024.
Ding et al. [2023a] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023a.
Ding et al. [2023b] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. MOSE: A new dataset for video object segmentation in complex scenes. In ICCV, 2023b.
Gao et al. [2023] Mingqi Gao, Feng Zheng, James JQ Yu, Caifeng Shan, Guiguang Ding, and Jungong Han. Deep learning for video object segmentation: a review. Artificial Intelligence Review, 56(1):457–531, 2023.
Gavrilyuk et al. [2018] Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In CVPR, pages 5958–5966, 2018.
Han et al. [2023] Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, and Yu Qiao. Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation. In ICCV, pages 13414–13423, 2023.
He and Ding [2024] Shuting He and Henghui Ding. Decoupling static and hierarchical motion perception for referring video segmentation. In CVPR, 2024.
Heo et al. [2022] Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, and Seon Joo Kim. Vita: Video instance segmentation via object token association. In NeurIPS, pages 23109–23120, 2022.
Hong et al. [2024] Lingyi Hong, Zhongying Liu, Wenchao Chen, Chenzhi Tan, Yuang Feng, Xinyu Zhou, Pinxue Guo, **glun Li, Zhaoyu Chen, Shuyong Gao, et al. Lvos: A benchmark for large-scale long-term video object segmentation. arXiv preprint arXiv:2404.19326, 2024.
Khoreva et al. [2018] Anna Khoreva, Anna Rohrbach, and Bernt Schiele. Video object segmentation with language referring expressions. In ACCV, pages 123–141, 2018.
Li et al. [2023a] Guanghui Li, Mingqi Gao, Heng Liu, Xiantong Zhen, and Feng Zheng. Learning cross-modal affinity for referring video object segmentation targeting limited samples. In ICCV, pages 2684–2693, 2023a.
Li et al. [2023b] Xiang Li, **glu Wang, Xu Xiaohao, Li Xiao, Raj Bhiksha, and Lu Yan. Robust referring video object segmentation with cyclic structural consensus. In ICCV, 2023b.
Luo et al. [2023a] Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. Soc: Semantic-assisted object cluster for referring video object segmentation. In NeurIPS, 2023a.
Luo et al. [2023b] Zhuoyan Luo, Yicheng Xiao, Yong Liu, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. 1st place solution for 5th lsvos challenge: Referring video object segmentation. In ICCV Workshop, 2023b.
Mao et al. [2016] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016.
Miao et al. [2023] Bo Miao, Mohammed Bennamoun, Yongsheng Gao, and Ajmal Mian. Spectrum-guided multi-granularity referring video object segmentation. In ICCV, 2023.
Oh et al. [2019] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, pages 9226–9235, 2019.
Seo et al. [2020] Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ECCV, pages 208–223. Springer, 2020.
Seong et al. [2020] Hongje Seong, Junhyuk Hyun, and Euntai Kim. Kernelized memory network for video object segmentation. In ECCV, pages 629–645. Springer, 2020.
Sun et al. [2022] Rui Sun, Naisong Luo, Yuan Wang, Yuwen Pan, Huayu Mai, Zhe Zhang, and Tianzhu Zhang. 1st place solution for youtubevos challenge 2022: Video object segmentation. In CVPR Workshop, 2022.
Wu et al. [2022] Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and ** Luo. Language as queries for referring video object segmentation. In CVPR, pages 4974–4984, 2022.
Yan et al. [2023a] Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, ** Luo, Zehuan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. In CVPR, pages 15325–15336, 2023a.
Yan et al. [2023b] Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei Zhang, Hongyang Li, Yu Qiao, Zhongjiang He, and Peng Gao. Referred by multi-modality: A unified temporal transformer for video object segmentation. arXiv preprint arXiv:2305.16318, 2023b.
Yang et al. [2021] Zongxin Yang, Yunchao Wei, and Yi Yang. Associating objects with transformers for video object segmentation. NeurIPS, 34, 2021.
Yang et al. [2024] Zongxin Yang, Jiaxu Miao, Yunchao Wei, Wenguan Wang, Xiaohan Wang, and Yi Yang. Scalable video object segmentation with identification mechanism. IEEE TPAMI, 2024.
Yu et al. [2016] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In ECCV, pages 69–85, 2016.
Zhu et al. [2021] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: deformable transformers for end-to-end object detection. In ICLR, 2021.