License: arXiv.org perpetual non-exclusive license
arXiv:2103.10702v4 [cs.CV] 19 Jan 2024

ClawCraneNet: Leveraging Object-level Relation for
Text-based Video Segmentation

Chen Liang  Yu Wu  Yawei Luo  Yi Yang
Zhejiang University  Baidu Research  ReLER, University of Technology Sydney
Abstract
Extended version published in [27].

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin.

1 Introduction

With the significant progresses achieved on computer vision and natural language processing, novel tasks requiring a joint understanding of both visual and linguistic modalities emerge recently, e.g., visual question answering [1, 30, 3], image [50, 1, 43] and video captioning [9, 40], vision-dialog navigation [2, 53] and so on. Inspired by such great success, Gavrilyuk et al. [10] introduces a challenging task of text-based video segmentation, which takes a video and a natural language description as inputs, and prospects a set of segmentation masks for the referent. Certain solutions for tackling this task lie in a comprehensively understanding of the visual and linguistic information with a fine-grained analysis of video contents.

Refer to caption
Figure 1: (a) Previous bottom-up methods mainly perform semantic relationship formulation at the pixel level. Corresponding models could not correctly identify the high-level relation merely based on local perceptive fields, and directly leads to an ambiguous prediction. (b) Our top-down pipeline first performs feature extraction for objects and then model the crucial relation information based on a high-level sensation, leading to better segmentation masks by conducting multi-modal retrieving. Vividly, we analogize the process of retrieving a visual object with the linguistic query as playing a claw crane machine.

Accordingly, several recent works [39, 33, 38, 34] proposed to capitalize on the pixel-level visual language relationship for video comprehension. These techniques are approached by a bottom-up paradigm, which mainly focuses on multi-modal feature fusion and pixel-level relation establishment to generate the segmentation masks directly. While some effectiveness has been achieved, these attempts generally lack the sensation of object-level information and relation, which is crucial for understanding semantic information, especially for multi-modal comprehension [15, 16]. As illustrated in Figure 1 (a), methods with low-level sensation could only formulate local region-level relationship, which is insufficient for modeling semantic relationship with natural language description logic. Thus the bottom-up approaches would inevitably introduce noisy relationship modeling and inaccurate object comprehension, leading to ambiguous segmentation results.

To tackle the above issue, partially inspired by the human visual cognitive system that humans preferentially direct attention towards meaningful entities (object-orientated) [13, 5, 41], and then extracts structured information on how entities relate to or interact with each other (relation-based) [21, 41, 22], we propose a top-down pipeline to mimic how humans localize the referent with language guidance, i.e., first finding candidates and then parsing relations. Specifically, by applying an off-the-shelf instance segmentation module to find out candidate objects, we could tackle the Text-based Video Segmentation problem in an object-level cross-modal retrieval manner. To our best knowledge, this is the first attempt to tackle the text-based video segmentation problem from the top-down view.

We further explore three kinds of object-level relations in our top-down pipeline, i.e., positional relation, semantic relation, and temporal relation.

Firstly, we propose a relative position encoding module to encode spatial information of each candidate object. On the basis of absolute position, we further consider the relative ranked index for each object according to its coordinates. The relative index addresses the spatial relations that are common in natural language descriptions, e.g., “the second guy from the left”.

Secondly, we propose a language-guided object attention module to construct semantic relations, which directly highlights the referring entities. As shown in Figure 1, related objects (e.g., “guy” and “ball”) will exchange information according to responsiveness to relational expressions (e.g., “kicking”). After relation-aware language comprehension, a particular referent feature (e.g., “guy”) would contain rich semantic evidence including relationship (e.g., “kicking a ball”) and attribute (e.g., “in black”).

Thirdly, we investigate the temporal relationship between inter-frame objects by a merge-by-track diagram. Particularly, with a multi-object tracking strategy, we perform inter-frame object association based on similarities, and build temporally related tracks with the Hungarian algorithm [25]. Finally, the final prediction is performed according to the average of confidence scores in each track.

In this way, we obtain a visual embedding with rich individual and mutual information, which would facilitate a correct language-to-vision corresponding in the complex video context. Vividly, these visual objects are like well-packaged dolls displayed in a claw crane machine, and the linguistic description performs as a claw looking for the shiniest doll. Our network just explicitly formulates the retrieving pipeline, and that’s why we named our network ClawCraneNet. The main contributions are as follows:

  • We propose a novel top-down pipeline that tackles the text-based video segmentation task in a retrieval manner.

  • We explicitly investigate three kinds of object-level relations to progressively construct discriminative visual embedding, i.e., relative positional relation, cross-object semantic relation, and inter-frame temporal relation.

  • The proposed method significantly outperforms state-of-the-art methods on two popular text-guided video segmentation datasets, i.e., A2D, and J-HMDB.

2 Related Work

2.1 Referring Image Segmentation

Referring expression segmentation aims at precisely localizing the entity referred by a natural language expression with a pixel-level segmentation mask. The bottom-up methods [14, 28, 49, 35, 7, 32, 16, 19, 6, 47] mainly construct a multi-modal feature, then generate referring masks after some refinement progress. Most state-of-the-art works conduct the structures of fully convolutional network (FCN) [29] to generate the pixel-level segmentation mask. At first, Hu et al. [14] directly leverage the concatenation of visual and linguistic features from CNN and LSTM to construct multi-modal feature and generate the final mask. Later, several techniques are incorporated into this field , e.g. multi-modal LSTM [28], image-to-word attention [49], dynamic filter [32], and adversarial learning [35] or cycle-consistency [7] between referring expression and its reconstructed caption. Recently, to explore the relationship between multi-modal features [44] and further model the structural context, Hu et al. [14] propose a bi-directional cross-modal attention module to emphasize visual guidance on linguistic features. Huang et al. [16] utilize a graph-based structure to progressively exploits different types of words in the expression. Different from these works which focus on low-level feature comprehension. Inspired by the human vision system, e.g. finding the referring objects then parsing the relation, we investigate object-level feature retrieving as another alternative.

2.2 Top-down Text-based Object Grounding

The existing top-down methods [52, 36, 48, 51, 42, 4] mainly leverage the pre-trained detector, e.g., Mask R-CNN [11], to generate object proposals, and then rank the box-level objects according to similarity score among vison-language embeddings. With the same attempts, we tend to follow the same top-down strategy, utilize an off-the-shelf instance segmentation method to perceive candidate objects. However, different from existing methods which mainly realize the box-level object feature matching as the main task and consider the segmentation mask as a by-product of the modular comprehension procedure by simply replacing the output heads. With the object feature constructed on bounding boxes, these methods would not handle the occlusions among objects in a video, especially for crowded scenes. In this work, we try to directly model the object feature based on fine-grained segmentation masks to learn more discriminative object features. Additionally, we further explore multi-modal relationship modeling among high-level visual object features.

2.3 Text-based Video Segmentation

Certain success has been achieved in referring image segmentation. Beyond image domain, the temporal coherence of referred video objects is still waiting to be explored. Recently, Gavrilyuk et al. [10] extend Actor-Action Dataset (A2D) with human-annotated sentences and introduce the challenging task of actor and action video segmentation from referring expressions. They adopt language guided dynamic convolution filters to fuse the multi-modal feature. Since then, bottom-up methods have sprung up. Wang et al. [39] utilize asymmetric attention mechanisms to facilitate visual guided linguistic feature learning. Later, they [38] extend vanilla dynamic convolution with a context modulated dynamic convolution kernel. Ning et al. [34] convert spatial relations to terms of direction and range for better linguistic spatial formulation. McIntosh et al. [33] introduce a capsule-based approach for better capturing the relationship between multi-modal features. For further mining continuous temporal information, they extend the A2D dataset with annotations for all frames. Hui et al. [18] introduce an additional 2D spatial encoder to alleviate the intrinsic spatial misaligned problem in 3D CNNs.

In this work, with the concrete objects obtained with the instance segmentation module, we explicitly exploit temporal coherence among all frames in a video including annotated key-frames and unlabeled frames.

Refer to caption
Figure 2: The framework of our proposed ClawCraneNet. As a top-down pipeline, objects are first perceived by an off-the-shelf instance segmentation module, and then selected by finding best visual-semantic match. During this process, we populate information among objects by performing three kinds of relation formulation module, i.e., positional relation, text-guided semantic relation, and temporal relation. We then utilize linguistic embedding to retrieve the final prediction.

In this work, with the concrete objects obtained with the instance segmentation module, we explicitly exploit temporal coherence among all frames in a video including annotated key-frames and unlabeled frames.

3 Methods

3.1 Top-Down Pipeline

Previous work [39, 33, 38, 34] tackles Text-based Video Segmentation task from a bottom-up pixel-level way. Differently, we view it as a cross-modal retrieval problem by decomposing the task into two stages. The first stage is to find out all potential candidate objects and their masks in the video, and the second one is to populate information among objects and select the best matched candidate given the referring sentence.

Suppose a video V𝑉Vitalic_V has M𝑀Mitalic_M frames, where each frame fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT candidate objects {xi1,xi2,,xiki}superscriptsubscript𝑥𝑖1superscriptsubscript𝑥𝑖2superscriptsubscript𝑥𝑖subscript𝑘𝑖\{x_{i}^{1},x_{i}^{2},...,x_{i}^{k_{i}}\}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }. Our target is to retrieve a set of referring objects T*={x1*,x2*,,xM*}superscript𝑇superscriptsubscript𝑥1superscriptsubscript𝑥2superscriptsubscript𝑥𝑀T^{*}=\{x_{1}^{*},x_{2}^{*},...,x_{M}^{*}\}italic_T start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT } in a video by a natural language referring sentence L𝐿Litalic_L. The sentence L𝐿Litalic_L is composed of a sequence of words (w1,w2,,wNl)subscript𝑤1subscript𝑤2subscript𝑤subscript𝑁𝑙(w_{1},w_{2},...,w_{N_{l}})( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where Nlsubscript𝑁𝑙N_{l}italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the length of input sentence. The task is taking the language description L𝐿Litalic_L to retrieve the target object track T𝑇Titalic_T from a video.

Linguistic Embedding Construction. The language expression L𝐿Litalic_L is first processed via a bi-LSTM [17], where the hidden states {h1,h2,hNl}subscript1subscript2subscriptsubscript𝑁𝑙\{h_{1},h_{2},\cdots h_{N_{l}}\}{ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ italic_h start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT } are further encoded by a self-guided attention module. In particular, the linguistic embedding \mathcal{L}caligraphic_L can be obtained by:

\displaystyle\mathcal{L}caligraphic_L =𝙼𝙻𝙿(i=1Nlαihi),absent𝙼𝙻𝙿superscriptsubscript𝑖1subscript𝑁𝑙subscript𝛼𝑖subscript𝑖\displaystyle=\texttt{MLP}(\sum_{i=1}^{N_{l}}{\alpha_{i}h_{i}}),= MLP ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)

where α𝛼\alphaitalic_α is the word-level attention weights that calculated by αi=𝚜𝚘𝚏𝚝𝚖𝚊𝚡(𝚏𝚌(hi))subscript𝛼𝑖𝚜𝚘𝚏𝚝𝚖𝚊𝚡𝚏𝚌subscript𝑖\alpha_{i}=\texttt{softmax}(\texttt{fc}(h_{i}))italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( fc ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) and MLP denotes the multi-layer perception. The self-guided attention module introduces a flexible way for the language encoder to focus on keywords and reduce the negative impact caused by sentence truncation or padding. Following we illustrate our designed top-down pipeline for this task.

Mask Out Foreground Objects. To localize objects in videos, we first build an instance segmentation model by considering the visual content only. Specifically, we use CondInst [37] as our backbone, and train the model using all the object masks. Then we apply the instance segmentation model on each frame and detect and segment all the foreground objects as candidates. Note that we do not exploit additional data/annotations via the instance segmentation model. Denote the object segmentation masks in frame fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as {oj}j=1Nvsuperscriptsubscriptsuperscript𝑜𝑗𝑗1subscript𝑁𝑣\{o^{j}\}_{j=1}^{N_{v}}{ italic_o start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the number of candidates.

Individual Object features. We then obtain the j𝑗jitalic_j-th individual object feature vjsuperscript𝑣𝑗v^{j}italic_v start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT by max-pooling on mask-cropped feature map extracted from the visual CNN model. Formally, the process could be achieved by,

vjsuperscript𝑣𝑗\displaystyle v^{j}italic_v start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT =𝙼𝙻𝙿(𝙼𝚊𝚡(Fvoj)),absent𝙼𝙻𝙿𝙼𝚊𝚡direct-productsubscript𝐹𝑣superscript𝑜𝑗\displaystyle=\texttt{MLP}(\texttt{Max}(F_{v}\odot o^{j})),= MLP ( Max ( italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⊙ italic_o start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) , (2)

where direct-product\odot is element-wise multiplication, Max stands for global max pooling and Fvsubscript𝐹𝑣F_{v}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes feature map of the entire frame. During training, the CNN model is updated in an end-to-end manner. Via Eq. 2, we build individual object feature vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by applying its instance segmentation mask ojsubscript𝑜𝑗o_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on top of the CNN feature map.

Find Visual-Linguistic Match. From the top-down perceptive, the final target is to select a best match among all candidate objects given the language input. Thus we train our model to maximize the matching score between referring object track and the language representation by,

T*superscript𝑇\displaystyle T^{*}italic_T start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT =argmaxi=1MS(vi*,).absentargmaxsuperscriptsubscript𝑖1𝑀𝑆superscriptsubscript𝑣𝑖\displaystyle=\operatorname*{arg\,max}\sum_{i=1}^{M}{S(v_{i}^{*},\mathcal{L})}.= start_OPERATOR roman_arg roman_max end_OPERATOR ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_S ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , caligraphic_L ) . (3)

Therefore, the core of the problem is to learn a proper visual object embedding that distinguishes the target objects from others. In the later sections, details about how to learn discriminative multi-modal embeddings are introduced.

3.2 Object-level Visual Embedding Construction

As many entities exist in a visual scene, semantic information of describing entity categories is not enough for distinguishing them. Therefore, it is natural to populate information among all the candidates to facilitate retrieving progress. In this section, on the basis of individual object representation, we further leverage three kinds of object-level relation, i.e., positional relation, text-guided semantic relation, and inter-frame temporal relation.

Positional Relation Module. Positional information is crucial in depicting a object in images/videos. We design a Positional Relation Module (PRM) to encode the object-level spatial information with pi=(xmini,ymini,xmaxi,ymaxi,xci,yci,wi,hi,rix,riy)subscript𝑝𝑖superscriptsubscript𝑥𝑚𝑖𝑛𝑖superscriptsubscript𝑦𝑚𝑖𝑛𝑖superscriptsubscript𝑥𝑚𝑎𝑥𝑖superscriptsubscript𝑦𝑚𝑎𝑥𝑖superscriptsubscript𝑥𝑐𝑖superscriptsubscript𝑦𝑐𝑖subscript𝑤𝑖subscript𝑖superscriptsubscript𝑟𝑖𝑥superscriptsubscript𝑟𝑖𝑦p_{i}=(x_{min}^{i},y_{min}^{i},x_{max}^{i},y_{max}^{i},x_{c}^{i},y_{c}^{i},w_{% i},h_{i},r_{i}^{x},r_{i}^{y})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) , where (xmini,ymini)superscriptsubscript𝑥𝑚𝑖𝑛𝑖superscriptsubscript𝑦𝑚𝑖𝑛𝑖(x_{min}^{i},y_{min}^{i})( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), (xmaxi,ymaxi)superscriptsubscript𝑥𝑚𝑎𝑥𝑖superscriptsubscript𝑦𝑚𝑎𝑥𝑖(x_{max}^{i},y_{max}^{i})( italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), (xci,yci)superscriptsubscript𝑥𝑐𝑖superscriptsubscript𝑦𝑐𝑖(x_{c}^{i},y_{c}^{i})( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ), wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the normalized top-left coordinates, bottom-right coordinates, center coordinates, width and height of the smallest circumscribed box of segment oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. The last two dimensions rixsuperscriptsubscript𝑟𝑖𝑥r_{i}^{x}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and riysuperscriptsubscript𝑟𝑖𝑦r_{i}^{y}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT are the normalized relative position index according to the x-axis and y-axis coordinates. Then, spatial enhanced object features 𝒱isubscript𝒱𝑖\mathcal{V}_{i}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated from:

𝒱i=vi+Wp(pi),subscript𝒱𝑖subscript𝑣𝑖subscript𝑊𝑝subscript𝑝𝑖\displaystyle\mathcal{V}_{i}=v_{i}+W_{p}(p_{i}),caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (4)

where Wpsubscript𝑊𝑝W_{p}italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is a learnable matrix. With the explicitly modeling of relative position information, our network earns the ability to handle the referring like “the second from the left” which is hard for low-level networks to infer with pixel-level spatial encoding only. We show by experiments that even with such slight guidance, our model does learn to comprehend relative spatial descriptions. During the training phase, to enhance the model to be position-aware, we randomly horizontal flip the frame image and swap the corresponding direction textual descriptions (e.g., changing from “right” to “left”).

Text-guided Semantic Relation Module. On formulating intra-frame object-level relationships, a simple idea is to employ the common-used vanilla attention module as a relation formulator. However, vanilla attention boosts information exchange naively based on feature similarity, which is hard to distinguish with the within-modal similarity. To diminish the gap, we introduce text guidance into the semantic relation module. As the referring example illustrated in Figure 1, comparing with directly forming the relationship between “Guy” and “Ball”, it is easier for the network to infer the “Ball” if it has already known that the “Guy” is “Kicking” something. Based on the aforementioned motivation, we devise a Text-guided Semantic Relation Module (TSRM) to leverage relational expression in language description. As illustrated in Figure 3, TSRM takes concatenated object features fV=(𝒱1,𝒱2,,𝒱Nv)subscript𝑓𝑉subscript𝒱1subscript𝒱2subscript𝒱subscript𝑁𝑣f_{V}=(\mathcal{V}_{1},\mathcal{V}_{2},...,\mathcal{V}_{N_{v}})italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and the sentence L𝐿Litalic_L as inputs. TSRM first learns self-guided weights for representing relationship-aware linguistic feature ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by,

fosubscript𝑓𝑜\displaystyle f_{o}italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT =Wo(𝙲𝚘𝚗𝚌𝚊𝚝(fV,ft)),absentsubscript𝑊𝑜𝙲𝚘𝚗𝚌𝚊𝚝subscript𝑓𝑉subscript𝑓𝑡\displaystyle=W_{o}(\texttt{Concat}(f_{V},f_{t})),= italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( Concat ( italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (5)

where 𝙲𝚘𝚗𝚌𝚊𝚝(·,·)𝙲𝚘𝚗𝚌𝚊𝚝··\texttt{Concat}(\textperiodcentered,\textperiodcentered)Concat ( · , · ) represents the concatenation operation along the channel axis, Wosubscript𝑊𝑜W_{o}italic_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is a learnable matrix. Next in relation stage, the original visual feature is utilized to query the multi-modal feature. Given the query fq=WqfVsubscript𝑓𝑞subscript𝑊𝑞subscript𝑓𝑉f_{q}=W_{q}f_{V}italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, the key fk=Wkfosubscript𝑓𝑘subscript𝑊𝑘subscript𝑓𝑜f_{k}=W_{k}f_{o}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and the value fv=Wvfosubscript𝑓𝑣subscript𝑊𝑣subscript𝑓𝑜f_{v}=W_{v}f_{o}italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the process to get text-guided object features Fo=(𝒱1r,𝒱2r,,𝒱Nvr)subscript𝐹𝑜superscriptsubscript𝒱1𝑟superscriptsubscript𝒱2𝑟superscriptsubscript𝒱subscript𝑁𝑣𝑟F_{o}=(\mathcal{V}_{1}^{r},\mathcal{V}_{2}^{r},...,\mathcal{V}_{N_{v}}^{r})italic_F start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = ( caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , … , caligraphic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) could be formulated as,

Fo=fV+𝚜𝚘𝚏𝚝𝚖𝚊𝚡(fqfkTdk)fv,subscript𝐹𝑜subscript𝑓𝑉𝚜𝚘𝚏𝚝𝚖𝚊𝚡subscript𝑓𝑞superscriptsubscript𝑓𝑘𝑇subscript𝑑𝑘subscript𝑓𝑣F_{o}=f_{V}+\texttt{softmax}(\frac{f_{q}f_{k}^{T}}{\sqrt{d_{k}}})f_{v},italic_F start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT + softmax ( divide start_ARG italic_f start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , (6)

where 𝒱irsuperscriptsubscript𝒱𝑖𝑟\mathcal{V}_{i}^{r}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is the relation enhanced object feature and dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the channel dimension of fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. During the procedure, each object could earn query-focused global context information especially when there is a strong response between the visual object and linguistic description.

Refer to caption
Figure 3: Illustration of the text-guided semantic relation module. tensor-product\otimes: Matrix Multiplication; c⃝: Matrix Concatenation; Three boxes with different colors stand for three different objects. Self-guided linguistic context L𝐿Litalic_L is used as a guidance to infer the relationship between object features fVsubscript𝑓𝑉f_{V}italic_f start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT.

Temporal Relation Module. To deal with blurry or complicated scenes in a video, a natural idea is to borrow the confident judgment from a clear scene for tackling hard scenes. In this part, we employ a tracking-based strategy to meet this purpose. Particularly, cross-frame objects are associated based on visual similarities to form a track. Given any two object (xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) from adjacent frames, the association similarity 𝒮ssubscript𝒮𝑠\mathcal{S}_{s}caligraphic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is formulated as follows:

𝒮s(xi,xj)subscript𝒮𝑠subscript𝑥𝑖subscript𝑥𝑗\displaystyle\mathcal{S}_{s}(x_{i},x_{j})caligraphic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =𝒮c(𝒱ir,𝒱jr)+α*U(xi,xj),absentsubscript𝒮𝑐superscriptsubscript𝒱𝑖𝑟superscriptsubscript𝒱𝑗𝑟𝛼𝑈subscript𝑥𝑖subscript𝑥𝑗\displaystyle=\mathcal{S}_{c}(\mathcal{V}_{i}^{r},\mathcal{V}_{j}^{r})+\alpha*% U(x_{i},x_{j}),= caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) + italic_α * italic_U ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (7)

where 𝒮csubscript𝒮𝑐\mathcal{S}_{c}caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the cosine similarity and U𝑈Uitalic_U represents the mask IoU. Following the multi-object tracking strategy in [46], we maintain several active tracks initialized from the first frame by treating each candidate object as an exclusive track. For each frame, we compute the visual similarity between the all active tracks and all candidate object embeddings in the current frame according to Eq. 7. The association procedure is allowed, only when the visual similarity is greater than a threshold γ𝛾\gammaitalic_γ and the active track would be ended if it is not updated for β𝛽\betaitalic_β matching rounds. The Hungarian algorithm [25] is applied to perform multi-object matching. After the aforementioned procedure, unassigned segments will start new tracks and repeat the object association until the final frame. Since the target object may appear or disappear in internal frames, it’s reasonable to allow intermittent tracks.

Methods Overlap mAP IoU FPS
[email protected] [email protected] [email protected] [email protected] [email protected] 0.5:0.95 Overall Mean
Hu et al. [14] ECCV16 34.8 23.6 13.3 3.3 0.1 13.2 47.4 35.0 -
Li et al. [26] CVPR17 38.7 29.0 17.5 6.6 0.1 16.3 51.5 35.4 -
Gavrilyuk et al. [10] CVPR18 53.8 43.7 31.8 17.1 2.1 26.9 57.4 48.1 -
Wang et al. [39] ICCV19 55.7 45.9 31.9 16.0 2.0 27.4 60.1 49.0 8.64
McIntosh et al. [33] CVPR20 52.6 45.0 34.5 20.7 3.6 30.3 56.8 46.0 -
Wang et al. [38] AAAI20 60.7 52.5 40.5 23.5 4.5 33.3 62.3 53.1 7.18
Ning et al. [34] IJCAI20 63.4 57.9 48.3 32.2 8.3 38.8 66.1 52.9 5.42
Ours 70.4 67.7 61.7 48.9 17.1 49.4 63.1 59.9 9.27
Table 1: Comparison with state-of-the-art methods on the A2D Sentences using IoU and Precision@K as metrics.
Methods Overlap mAP IoU
[email protected] [email protected] [email protected] [email protected] [email protected] 0.5:0.95 Overall Mean
Hu et al. [14] ECCV16 63.3 35.0 8.5 0.2 0.0 17.8 54.6 52.8
Li et al. [26] CVPR17 57.8 33.5 10.3 0.6 0.0 17.3 52.9 49.1
Gavrilyuk et al. [10] CVPR18 71.2 51.8 26.4 3.0 0.0 26.7 55.5 57.0
Wang et al. [39] ICCV19 75.6 56.4 28.7 3.4 0.0 28.9 57.6 58.4
McIntosh et al. [33] CVPR20 67.7 51.3 28.3 5.1 0.0 26.1 53.5 55.0
Wang et al. [38] AAAI20 74.2 58.7 31.6 4.7 0.0 30.1 55.4 57.6
Ning et al. [34] IJCAI20 69.1 57.2 31.9 6.0 0.1 29.4 - -
Ours 88.0 79.6 56.6 14.7 0.2 43.3 64.4 65.5
Table 2: Comparison with state-of-the-arts on the J-HMDB Sentences with the best model trained on A2D Sentences without finetuning.

3.3 Training and Inference

Once all the comprehension procedure has been done, the final set of visual embeddings {𝒱1r,𝒱2r,,𝒱Nvr}superscriptsubscript𝒱1𝑟superscriptsubscript𝒱2𝑟superscriptsubscript𝒱subscript𝑁𝑣𝑟\{\mathcal{V}_{1}^{r},\mathcal{V}_{2}^{r},\cdots,\mathcal{V}_{N_{v}}^{r}\}{ caligraphic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , caligraphic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , ⋯ , caligraphic_V start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } and linguistic embedding \mathcal{L}caligraphic_L are obtained. We calculate the cosine similarity between the linguistic embedding and each visual candidate, 𝒱ir𝚃superscriptsuperscriptsubscript𝒱𝑖𝑟𝚃{\mathcal{V}_{i}^{r}}^{\texttt{T}}\mathcal{L}caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT caligraphic_L. Here we enforce all vectors to be L2-normalized feature embeddings, i.e., 𝒱ir=1normsuperscriptsubscript𝒱𝑖𝑟1||\mathcal{V}_{i}^{r}||=1| | caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT | | = 1, =1norm1||\mathcal{L}||=1| | caligraphic_L | | = 1. We adopt the contrastive learning loss for optimizing the model,

sisubscript𝑠𝑖\displaystyle s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =exp(𝒱ir𝚃/τ)j=1Nvexp(𝒱jr𝚃/τ),absentsuperscriptsuperscriptsubscript𝒱𝑖𝑟𝚃𝜏superscriptsubscript𝑗1subscript𝑁𝑣superscriptsuperscriptsubscript𝒱𝑗𝑟𝚃𝜏\displaystyle=\frac{\exp({\mathcal{V}_{i}^{r}}^{\texttt{T}}\mathcal{L}/\tau)}{% \sum_{j=1}^{N_{v}}\exp({\mathcal{V}_{j}^{r}}^{\texttt{T}}\mathcal{L}/\tau)},= divide start_ARG roman_exp ( caligraphic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT caligraphic_L / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( caligraphic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT caligraphic_L / italic_τ ) end_ARG , (8)
loss =log(sgt),absentsubscript𝑠𝑔𝑡\displaystyle=-\log(s_{gt}),= - roman_log ( italic_s start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT ) , (9)

where sgtsubscript𝑠𝑔𝑡s_{gt}italic_s start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT is the matching score of the ground-truth object, τ𝜏\tauitalic_τ is a temperature parameter that controls the concentration level of the distribution. Higher τ𝜏\tauitalic_τ leads to a softer probability distribution. We set τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1 in our experiments.

During the inference phase, our network first extracts multi-modal embeddings for each frame. Then, Temporal Relation Module is conducted to obtain the candidate tracks. The final track is retrieved by choosing the candidate track with the highest matched candidate.

Besides, as a prerequisite for the object association step, visual embeddings belonging to the same object are implicitly pulled together since they are all expected to be close with the same linguistic embedding. More explanations are conducted in supplementary materials.

Methods Overlap mAP IoU
[email protected] [email protected] [email protected] [email protected] [email protected] 0.5:0.95 Overall Mean
Top-Down Pipeline 64.4 62.1 56.9 45.3 16.4 45.6 58.4 55.1
+ Absolute PE 66.8 64.2 58.7 46.9 16.4 47.0 60.6 56.9
+ Relative PE 66.3 63.9 60.2 46.9 16.3 46.9 60.2 56.7
+ PRM 67.2 64.5 58.9 47.2 16.4 47.2 61.1 57.4
+ PRM + Vanilla-attention 67.5 64.6 59.2 47.3 16.6 47.3 61.0 57.6
+ PRM + TSRM 68.6 66.0 60.2 48.1 16.8 48.3 62.3 58.6
+ PRM + TSRM + TRM 70.4 67.7 61.7 48.9 17.1 49.4 63.1 59.9
Table 3: Ablation studies on A2D Sentences. PE indicates Position Encoding. Positional Relation Module, Text-guided Semantic Relation Module, and Temporal Relation Module are abbreviated as “PRM”, “TSRM”, and “TRM”, respectively.

4 Experiments

4.1 Datasets and Evaluation Criteria

We conduct our experiments on two extended datasets: A2D Sentences and J-HMDB Sentences. These datasets are released in [10] by additionally providing corresponding human natural descriptions on original A2D [45] and J-HMDB [20] respectively. A2D Sentences contains 3782 videos in total with 8 action classes performed by 7 actor classes. Each video in A2D has 3 to 5 frames annotated with pixel-level actor-action segmentation masks. Besides, it contains 6,655 sentences corresponding to actors and their actions. Following settings in [39], we split the whole dataset into 3017 training videos, 737 testing videos, and 28 unlabeled videos. J-HMDB Sentences contains 928 short videos with 928 corresponding sentences describing 21 different action classes. Pixel-wise 2D articulated human puppet masks are provided for evaluating segmentation performance.

The proposed method is evaluated with the criteria of Intersection-over-Union (IoU) and precision. The overall IoU computes the ratio of the total intersection area divided by the total union area over testing samples. The mean IoU is the averaged IoU over all samples, which treats samples of different sizes equally. We also measure precision@K which considers the percentage of testing samples whose IoU scores are higher than threshold K at 5 different IoU thresholds and calculate mean average precision over 0.50:0.05:0.95 [10].

4.2 Implementation Details

Our network is built on a one-stage instance segmentation method named CondInst [37] for balanced performance and speed. It could be replaced with any other instance segmentation network. This model is initialized from ResNet101 [12] pre-trained on ImageNet [8] and further trained exclusively on A2D [45]. Note that we do not leverage any additional data/annotations when building the instance segmentation module.

For visual and linguistic feature extractor, we adopt ResNet50 [12] model pre-trained on ImageNet [8] as visual backbone and bi-LSTM [17] as text encoder. All input frames are resized to 320×320320320320\times 320320 × 320. Following the settings in [10], the maximum length of sentences is set to 20 and the dimension of word vector is 1000. We employ the hidden states of bi-LSTM [17] as sentence features with a dimension of 2000. The word embeddings are initialized with one-hot vectors without any pre-trained weights applied. The cross frame entity association threshold γ𝛾\gammaitalic_γ is set to 0.80.80.80.8 by default. Training is done with Adam optimizer [24] with an initial learning rate of 0.00010.00010.00010.0001, and a scheduler that waits for 2222 epochs after loss stagnation to reduce the learning rate by a factor of 10101010. The batch size is 16161616.

4.3 Comparison with State-of-the-Art Methods

We compare our ClawCraneNet with other state-of-the-art text-based video segmentation models following the settings in [10] on the two datasets, i.e., A2D Sentences and J-HMDB Sentences. The comparison results are demonstrated in Table 1 and Table 2. First on A2D Sentences, we evaluate [14, 26] pre-trained on ReferIt dataset [23] and then fine-tuned version on A2D sentences. Other methods including ours are trained on A2D Sentences exclusively. As shown in Table 1, with the help of object-level relation comprehension, our approach achieves state-of-the-art performance on most metrics with a remarkable margin, especially at higher IoU thresholds. On P@0.8𝑃@0.8[email protected]italic_P @ 0.8, our method outperforms the SOTA by a large margin of 16.7%percent\%%. Moreover, we bring 6.8%percent\%% improvement on Mean IoU and 10.4%percent\%% in mAP over SOTA respectively, which directly proves the effectiveness of our method. In spite of such obvious achievement on mean IoU, we get relatively poor performance on overall IoU. Owing to the special favor of large objects, overall IoU lacks the perception for smaller objects which is crucial for reflecting model performance. It seems ClawCraneNet not only captures obvious larger objects but also learns the object-level semantic context for distinguishing small objects. Besides, we found our method is also efficient (high FPS) compared to other bottom-up methods.

On J-HMDB Sentences, for fair comparisons, we follow the setting in [39, 38, 33, 10], and evaluate our model pre-trained on A2D sentences without any additional fine-tuning, which is kept the same as other compared methods. Our approach significantly outperforms previous state-of-the-art methods on all metrics considered. For the result of P@0.9𝑃@0.9[email protected]italic_P @ 0.9, one possible reason is that the ground truth masks from J-HMDB Sentences are generated from puppets. It’s hard to fit the data distribution with a model trained from precise segmentation masks.

Methods Backbone mAP IoU
0.5:0.95 Overall Mean
CondInst R-101-FPN 49.4 63.1 59.9
CondInst R-50-FPN 48.3 62.7 59.4
Mask R-CNN R-50-FPN 48.1 62.4 58.8
Table 4: Impact of Instance Segmentation Modules. With weaker modules, we still get better performance than SOTAs.
Refer to caption

[39]

Figure 4: Qualitative results of text-based video segmentation. We show three language query, and draw the corresponding segmentation results using the same query color. As queries 1 and 3 are predicted on a single object, the left most object in the first row of (c) is covered with both red (query 1) and green (query 3). (a) Original frames. (b) Results of the bottom-up method [39]. (c) Results of our basic top-down pipeline (row 1 in Table 3). (d) Results of our full model. (e) Ground truth. (f) Visualization of attention weights in our TSRM.
Refer to caption

[39]

Figure 5: Visualization results of a complex video. An object is covered with different colors if it is referred by more than one queries, e.g., the left most object in the first row of (c). (a) Original video frame. (b) Results of the bottom-up method [39]. (c) Results of our top-down pipeline. (d) Results of the PRM-enhanced top-down model (row 4 in Table 3). (e) Results of our full ClawCraneNet.

4.4 Ablation Studies

Effectiveness of Top-Down Pipeline. We first investigate the effectiveness of our designed Top-Down Pipeline. We evaluated the basic top-down pipeline (segment-embed-retrieve pipeline), which ignores all the relation information among candidate objects and removes all relation-based modules. The results are reported the results in Table 3. Compared to state-of-the-art bottom-up methods shown in Table 1, our top-down pipeline achieves significantly better performances, especially on high precision predictions.

Impact of Instance Segmentation Modules. In our experiments, for better trade-off between time cost and performance, we employ a one-stage segmentation method [37]. The performance of our ClawCraneNet with different instance segmentation methods is shown in Table 4. We use CondInst(R-101-FPN) as the off-the-shelf instance segmentation module by default. With weaker instance segmentation models, we still shows competitive performance compared with bottom-up approaches.

Impact of Positional Relation Module. As shown in Table 3, we have tried to adding different positional encoding methods, and achieved significant improvements compared to the basic top-down pipeline. We can conclude that position information is very useful for the top-down framework of text-based video segmentation task. In addition, we found our full Positional Relation Module (PRM) achieves better performances compared to absolute and relative position encoding methods. The reason is that absolute position encoding lacks the sensation for relative description like “the second from left” and thus it is hard for exclusively relative encoding to balance the weights for formulating absolute information and relative information.

Impact of Text-guided Semantic Relation Module. We further evaluate effectiveness of the proposed TSRM. As shown in Table 3 (5-6 row), the vanilla object-level self-attention does not benefit the performance. But certain improvement occurs when introducing text guide to formulate object-level relations. A possible reason is that the plain visual-based relation module cannot correctly gather relational information by just measuring context similarities. But with more linguistic information contained, concrete relations could be formulated, leading to a positive impact on the performance.

Impact of Temporal Relation Module. By fully utilizing the temporal coherence, our ClawCraneNet with temporal relation module (the last row in Table 3) further enhanced outperforms all the other models which validate the effects of the design. Conclusively, these results confirm the merits of the object-level relation formulation again.

4.5 Qualitative Analysis

We would like to investigate the internal mechanism in ClawCraneNet by analyzing qualitative results. Compared with the bottom-up method ((b) of Figure 4, Figure 5), our top-down pipeline shows reasonable segmentation results, while the other messes up the relational information and lead to ambiguous foreground masks. As shown in Figure 4 (c) and (d), when introducing the text-guided semantic relation module, our network learns to capture mutual information of corresponding objects. Visualization examples of text-guided attention weights are shown in Figure 4 (f). The comparisons between complete ClawCraneNet and alternative structures are illustrated in Figure 5. Only a part of the objects are labeled with language description in A2D Sentences, and we illustrate the unmentioned objects in Figure 5 with purple masks. Without the relative position module, the model tends to focus on objects that match the absolute position description “left”, resulting in a wrong prediction. Ourtemporal relation module helps to achieve temporal consistency among frames, and correct the misunderstanding of the previous module (Figure 5 (e)). In conclusion, these visualized results show the effectiveness of our top-down design and the object-relational modules in ClawCraneNet.

5 Conclusion

In this paper, we propose a novel ClawCraneNet following the segment-comprehend-retrieve strategy for the first attempt of introducing object-level relation into text-based video segmentation field. Different from previous bottom-up methods, our ClawCraneNet maximizes the semantic information flow between object-level features by fully investigating the relationship between intra-frame and inter-frame objects, i.e., positional relation, text-guided semantic relation, and inter-frame temporal relation. Evaluations on commonly used benchmark datasets demonstrate that ClawCraneNet surpasses all the state-of-the-art methods by large margins.

Appendix A Appendix

A.1 Analysis of Puppet Mask in J-HMDB

In this section, we give a brief explanation about the poor performance of P@0.9𝑃@0.9[email protected]italic_P @ 0.9 on J-HMDB Sentences [20] dataset (Line 8 in Table 2). As shown in Figure 6, ground-truth masks from J-HMDB Sentences [20] are performed by puppets which leads to inconsistency between the segmentation mask and the actual object. Since the evaluated model is trained from precise segmentation masks on A2D Sentences [45], it is hard for ClawCraneNet to fit the data distribution of J-HMDB without fine-tuning.

Refer to caption
Figure 6: Qualitative comparison between the predicted mask by ClawCraneNet and the ground-truth puppet mask.

A.2 Analysis of Object Embedding

We visualize predicted embeddings of 12 language queries for all candidate objects on a randomly selected video in A2D Sentences [45] validation set and use t-SNE [31] to embed visual object embeddings (256-dim) into a 2D space. As shown in Figure 7, embeddings belonging to different objects have clearly distinguishable margins, confirming that ClawCraneNet learns discriminative object embeddings which meets the prerequisites of temporal relation module.

Refer to caption
Figure 7: T-SNE Visualizations of learned object embeddings from ClawCraneNet.

A.3 More Details about Predicted Results

In Figure 8, we detail the predictions of our ClawCraneNet. In each sub-figure, we give the original RGB frame and language query as input, then plot the predicted results of our ClawCraneNet and ACGA [39]. All candidates refers to candidate objects perceived by the instance segmentation module.

Thanks to the object-level comprehension of ClawCraneNet, reasonable results with clear boundaries are generated. Specifically, compared to bottom-up methods which mainly focus on salient objects, our network could entirely perceive inconspicuous objects and distinguish them with semantic context. Some failure cases are shown in Figure 9, it is still hard for ClawCraneNet to handle some visual ambiguity, i.e., objects moving at high speed (Figure 9 (a)) or objects in the mirror (Figure 9 (c)).

An interesting observation is that with ambiguous language description like Man in red running, multiple fitting objects can be highlighted with higher similarity scores, as shown in Figure 9 (b). To give more examples, we supply some results in form of videos within the supplementary material.

Refer to caption
Figure 8: Qualitative results of text-based video segmentation.
Refer to caption
Figure 9: Failure cases of text-based video segmentation.

References

  • [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077–6086, 2018.
  • [2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pages 3674–3683, 2018.
  • [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
  • [4] Mohit Bajaj, Lanjun Wang, and Leonid Sigal. G3raphground: Graph-based language grounding. In ICCV, pages 4281–4290, 2019.
  • [5] Guy Thomas Buswell. How people look at pictures: a study of the psychology and perception in art. 1935.
  • [6] Ding-Jie Chen, Songhao Jia, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-Luh Liu. See-through-text grou** for referring image segmentation. In ICCV, pages 7454–7463, 2019.
  • [7] Yi-Wen Chen, Yi-Hsuan Tsai, Tiantian Wang, Yen-Yu Lin, and Ming-Hsuan Yang. Referring expression object segmentation with caption-aware consistency. In BMVC, page 263, 2019.
  • [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
  • [9] Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. Video captioning with attention-based lstm and semantic consistency. TMM, 19(9):2045–2055, 2017.
  • [10] Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In CVPR, pages 5958–5966, 2018.
  • [11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  • [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [13] John M Henderson and Taylor R Hayes. Meaning-based guidance of attention in scenes as revealed by meaning maps. Nature Human Behaviour, 1(10):743–747, 2017.
  • [14] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In ECCV, pages 108–124, 2016.
  • [15] Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. Bi-directional relationship inferring network for referring image segmentation. In CVPR, pages 4424–4433, 2020.
  • [16] Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. Referring image segmentation via cross-modal progressive comprehension. In CVPR, pages 10488–10497, 2020.
  • [17] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
  • [18] Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, and Fei Wang. Collaborative spatial-temporal modeling for language-queried video actor segmentation. In CVPR, pages 4187–4196, 2021.
  • [19] Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, and Jizhong Han. Linguistic structure guided context modeling for referring image segmentation. In ECCV, pages 59–75, 2020.
  • [20] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
  • [21] David A Kalkstein, Leor M Hackel, and Yaacov Trope. Person-centered cognition: The presence of people in a visual scene promotes relational reasoning. Journal of Experimental Social Psychology, 90:104009, 2020.
  • [22] Yun-Ching Kao, Emily S Davis, and John DE Gabrieli. Neural correlates of actual and predicted memory formation. Nature neuroscience, 8(12):1776–1783, 2005.
  • [23] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
  • [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • [25] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • [26] Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. Tracking by natural language specification. In CVPR, pages 6495–6503, 2017.
  • [27] Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, and Yi Yang. Local-global context aware transformer for language-guided video segmentation. IEEE TPAMI, 45(8):10055–10069, 2023.
  • [28] Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. In ICCV, pages 1271–1280, 2017.
  • [29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
  • [30] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In NeurIPS, pages 289–297, 2016.
  • [31] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 2008.
  • [32] Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. Dynamic multimodal instance segmentation guided by natural language queries. In ECCV, pages 630–645, 2018.
  • [33] Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. Visual-textual capsule routing for text-based video segmentation. In CVPR, pages 9942–9951, 2020.
  • [34] Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. Polar relative positional encoding for video-language segmentation. In IJCAI, 2020.
  • [35] Shuang Qiu, Yao Zhao, Jianbo Jiao, Yunchao Wei, and Shikui Wei. Referring image segmentation by generative adversarial learning. TMM, 22(5):1333–1344, 2019.
  • [36] Arka Sadhu, Kan Chen, and Ram Nevatia. Video object grounding using semantic roles in language description. In CVPR, pages 10417–10427, 2020.
  • [37] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In ECCV, 2020.
  • [38] Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. Context modulated dynamic networks for actor and action video segmentation with language queries. In AAAI, pages 12152–12159, 2020.
  • [39] Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In ICCV, pages 3939–3948, 2019.
  • [40] Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. Video captioning via hierarchical reinforcement learning. In CVPR, pages 4213–4222, 2018.
  • [41] Jeremy M Wolfe and Todd S Horowitz. Five factors that guide attention in visual search. Nature Human Behaviour, 1(3):1–8, 2017.
  • [42] Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and Subhransu Maji. Phrasecut: Language-based image segmentation in the wild. In CVPR, pages 10216–10225, 2020.
  • [43] Yu Wu, Linchao Zhu, Lu Jiang, and Yi Yang. Decoupled novel object captioner. In ACM MM, 2018.
  • [44] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. Dual attention matching for audio-visual event localization. In ICCV, 2019.
  • [45] Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J Corso. Can humans fly? action understanding with multiple classes of actors. In CVPR, pages 2264–2273, 2015.
  • [46] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang, Shilei Wen, Errui Ding, and Liusheng Huang. Segment as points for efficient online multi-object tracking and segmentation. In ECCV, 2020.
  • [47] Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one-stage approach to visual grounding. In ICCV, pages 4683–4693, 2019.
  • [48] Zhengyuan Yang, Tushar Kumar, Tianlang Chen, **gsong Su, and Jiebo Luo. Grounding-tracking-integration. TCSVT, 2020.
  • [49] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. Cross-modal self-attention network for referring image segmentation. In CVPR, pages 10502–10511, 2019.
  • [50] Quanzeng You, Hailin **, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, pages 4651–4659, 2016.
  • [51] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, pages 1307–1315, 2018.
  • [52] Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, pages 10668–10677, 2020.
  • [53] Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin, Jianbin Jiao, Xiaojun Chang, and Xiaodan Liang. Vision-dialog navigation by exploring cross-modal memory. In CVPR, pages 10730–10739, 2020.