ClawCraneNet: Leveraging Object-level Relation for
Text-based Video Segmentation

Chen Liang Yu Wu Yawei Luo Yi Yang
Zhejiang University Baidu Research ReLER, University of Technology Sydney

Abstract

^†^†Extended version published in [27].

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin.

1 Introduction

With the significant progresses achieved on computer vision and natural language processing, novel tasks requiring a joint understanding of both visual and linguistic modalities emerge recently, e.g., visual question answering [1, 30, 3], image [50, 1, 43] and video captioning [9, 40], vision-dialog navigation [2, 53] and so on. Inspired by such great success, Gavrilyuk et al. [10] introduces a challenging task of text-based video segmentation, which takes a video and a natural language description as inputs, and prospects a set of segmentation masks for the referent. Certain solutions for tackling this task lie in a comprehensively understanding of the visual and linguistic information with a fine-grained analysis of video contents.

Refer to caption — Figure 1: (a) Previous bottom-up methods mainly perform semantic relationship formulation at the pixel level. Corresponding models could not correctly identify the high-level relation merely based on local perceptive fields, and directly leads to an ambiguous prediction. (b) Our top-down pipeline first performs feature extraction for objects and then model the crucial relation information based on a high-level sensation, leading to better segmentation masks by conducting multi-modal retrieving. Vividly, we analogize the process of retrieving a visual object with the linguistic query as playing a claw crane machine.

Accordingly, several recent works [39, 33, 38, 34] proposed to capitalize on the pixel-level visual language relationship for video comprehension. These techniques are approached by a bottom-up paradigm, which mainly focuses on multi-modal feature fusion and pixel-level relation establishment to generate the segmentation masks directly. While some effectiveness has been achieved, these attempts generally lack the sensation of object-level information and relation, which is crucial for understanding semantic information, especially for multi-modal comprehension [15, 16]. As illustrated in Figure 1 (a), methods with low-level sensation could only formulate local region-level relationship, which is insufficient for modeling semantic relationship with natural language description logic. Thus the bottom-up approaches would inevitably introduce noisy relationship modeling and inaccurate object comprehension, leading to ambiguous segmentation results.

To tackle the above issue, partially inspired by the human visual cognitive system that humans preferentially direct attention towards meaningful entities (object-orientated) [13, 5, 41], and then extracts structured information on how entities relate to or interact with each other (relation-based) [21, 41, 22], we propose a top-down pipeline to mimic how humans localize the referent with language guidance, i.e., first finding candidates and then parsing relations. Specifically, by applying an off-the-shelf instance segmentation module to find out candidate objects, we could tackle the Text-based Video Segmentation problem in an object-level cross-modal retrieval manner. To our best knowledge, this is the first attempt to tackle the text-based video segmentation problem from the top-down view.

We further explore three kinds of object-level relations in our top-down pipeline, i.e., positional relation, semantic relation, and temporal relation.

Firstly, we propose a relative position encoding module to encode spatial information of each candidate object. On the basis of absolute position, we further consider the relative ranked index for each object according to its coordinates. The relative index addresses the spatial relations that are common in natural language descriptions, e.g., “the second guy from the left”.

Secondly, we propose a language-guided object attention module to construct semantic relations, which directly highlights the referring entities. As shown in Figure 1, related objects (e.g., “guy” and “ball”) will exchange information according to responsiveness to relational expressions (e.g., “kicking”). After relation-aware language comprehension, a particular referent feature (e.g., “guy”) would contain rich semantic evidence including relationship (e.g., “kicking a ball”) and attribute (e.g., “in black”).

Thirdly, we investigate the temporal relationship between inter-frame objects by a merge-by-track diagram. Particularly, with a multi-object tracking strategy, we perform inter-frame object association based on similarities, and build temporally related tracks with the Hungarian algorithm [25]. Finally, the final prediction is performed according to the average of confidence scores in each track.

In this way, we obtain a visual embedding with rich individual and mutual information, which would facilitate a correct language-to-vision corresponding in the complex video context. Vividly, these visual objects are like well-packaged dolls displayed in a claw crane machine, and the linguistic description performs as a claw looking for the shiniest doll. Our network just explicitly formulates the retrieving pipeline, and that’s why we named our network ClawCraneNet. The main contributions are as follows:

•

We propose a novel top-down pipeline that tackles the text-based video segmentation task in a retrieval manner.
•

We explicitly investigate three kinds of object-level relations to progressively construct discriminative visual embedding, i.e., relative positional relation, cross-object semantic relation, and inter-frame temporal relation.
•

The proposed method significantly outperforms state-of-the-art methods on two popular text-guided video segmentation datasets, i.e., A2D, and J-HMDB.

2 Related Work

2.1 Referring Image Segmentation

Referring expression segmentation aims at precisely localizing the entity referred by a natural language expression with a pixel-level segmentation mask. The bottom-up methods [14, 28, 49, 35, 7, 32, 16, 19, 6, 47] mainly construct a multi-modal feature, then generate referring masks after some refinement progress. Most state-of-the-art works conduct the structures of fully convolutional network (FCN) [29] to generate the pixel-level segmentation mask. At first, Hu et al. [14] directly leverage the concatenation of visual and linguistic features from CNN and LSTM to construct multi-modal feature and generate the final mask. Later, several techniques are incorporated into this field , e.g. multi-modal LSTM [28], image-to-word attention [49], dynamic filter [32], and adversarial learning [35] or cycle-consistency [7] between referring expression and its reconstructed caption. Recently, to explore the relationship between multi-modal features [44] and further model the structural context, Hu et al. [14] propose a bi-directional cross-modal attention module to emphasize visual guidance on linguistic features. Huang et al. [16] utilize a graph-based structure to progressively exploits different types of words in the expression. Different from these works which focus on low-level feature comprehension. Inspired by the human vision system, e.g. finding the referring objects then parsing the relation, we investigate object-level feature retrieving as another alternative.

2.2 Top-down Text-based Object Grounding

The existing top-down methods [52, 36, 48, 51, 42, 4] mainly leverage the pre-trained detector, e.g., Mask R-CNN [11], to generate object proposals, and then rank the box-level objects according to similarity score among vison-language embeddings. With the same attempts, we tend to follow the same top-down strategy, utilize an off-the-shelf instance segmentation method to perceive candidate objects. However, different from existing methods which mainly realize the box-level object feature matching as the main task and consider the segmentation mask as a by-product of the modular comprehension procedure by simply replacing the output heads. With the object feature constructed on bounding boxes, these methods would not handle the occlusions among objects in a video, especially for crowded scenes. In this work, we try to directly model the object feature based on fine-grained segmentation masks to learn more discriminative object features. Additionally, we further explore multi-modal relationship modeling among high-level visual object features.

2.3 Text-based Video Segmentation

Certain success has been achieved in referring image segmentation. Beyond image domain, the temporal coherence of referred video objects is still waiting to be explored. Recently, Gavrilyuk et al. [10] extend Actor-Action Dataset (A2D) with human-annotated sentences and introduce the challenging task of actor and action video segmentation from referring expressions. They adopt language guided dynamic convolution filters to fuse the multi-modal feature. Since then, bottom-up methods have sprung up. Wang et al. [39] utilize asymmetric attention mechanisms to facilitate visual guided linguistic feature learning. Later, they [38] extend vanilla dynamic convolution with a context modulated dynamic convolution kernel. Ning et al. [34] convert spatial relations to terms of direction and range for better linguistic spatial formulation. McIntosh et al. [33] introduce a capsule-based approach for better capturing the relationship between multi-modal features. For further mining continuous temporal information, they extend the A2D dataset with annotations for all frames. Hui et al. [18] introduce an additional 2D spatial encoder to alleviate the intrinsic spatial misaligned problem in 3D CNNs.

In this work, with the concrete objects obtained with the instance segmentation module, we explicitly exploit temporal coherence among all frames in a video including annotated key-frames and unlabeled frames.

3 Methods

3.1 Top-Down Pipeline

Previous work [39, 33, 38, 34] tackles Text-based Video Segmentation task from a bottom-up pixel-level way. Differently, we view it as a cross-modal retrieval problem by decomposing the task into two stages. The first stage is to find out all potential candidate objects and their masks in the video, and the second one is to populate information among objects and select the best matched candidate given the referring sentence.

Suppose a video $V$ has $M$ frames, where each frame $f_{i}$ contains $k_{i}$ candidate objects $\{x_{i}^{1},x_{i}^{2},...,x_{i}^{k_{i}}\}$ . Our target is to retrieve a set of referring objects $T^{*}=\{x_{1}^{*},x_{2}^{*},...,x_{M}^{*}\}$ in a video by a natural language referring sentence $L$ . The sentence $L$ is composed of a sequence of words $(w_{1},w_{2},...,w_{N_{l}})$ , where $N_{l}$ is the length of input sentence. The task is taking the language description $L$ to retrieve the target object track $T$ from a video.

Linguistic Embedding Construction. The language expression $L$ is first processed via a bi-LSTM [17], where the hidden states $\{h_{1},h_{2},\cdots h_{N_{l}}\}$ are further encoded by a self-guided attention module. In particular, the linguistic embedding $\mathcal{L}$ can be obtained by:

\displaystyle\mathcal{L}

\displaystyle=\texttt{MLP}(\sum_{i=1}^{N_{l}}{\alpha_{i}h_{i}}),

(1)

where $\alpha$ is the word-level attention weights that calculated by $\alpha_{i}=\texttt{softmax}(\texttt{fc}(h_{i}))$ and MLP denotes the multi-layer perception. The self-guided attention module introduces a flexible way for the language encoder to focus on keywords and reduce the negative impact caused by sentence truncation or padding. Following we illustrate our designed top-down pipeline for this task.

Mask Out Foreground Objects. To localize objects in videos, we first build an instance segmentation model by considering the visual content only. Specifically, we use CondInst [37] as our backbone, and train the model using all the object masks. Then we apply the instance segmentation model on each frame and detect and segment all the foreground objects as candidates. Note that we do not exploit additional data/annotations via the instance segmentation model. Denote the object segmentation masks in frame $f_{i}$ as $\{o^{j}\}_{j=1}^{N_{v}}$ , where $N_{v}$ is the number of candidates.

Individual Object features. We then obtain the $j$ -th individual object feature $v^{j}$ by max-pooling on mask-cropped feature map extracted from the visual CNN model. Formally, the process could be achieved by,

\displaystyle v^{j}

\displaystyle=\texttt{MLP}(\texttt{Max}(F_{v}\odot o^{j})),

(2)

where $\odot$ is element-wise multiplication, Max stands for global max pooling and $F_{v}$ denotes feature map of the entire frame. During training, the CNN model is updated in an end-to-end manner. Via Eq. 2, we build individual object feature $v_{j}$ by applying its instance segmentation mask $o_{j}$ on top of the CNN feature map.

Find Visual-Linguistic Match. From the top-down perceptive, the final target is to select a best match among all candidate objects given the language input. Thus we train our model to maximize the matching score between referring object track and the language representation by,

\displaystyle T^{*}

\displaystyle=\operatorname*{arg\,max}\sum_{i=1}^{M}{S(v_{i}^{*},\mathcal{L})}.

(3)

Therefore, the core of the problem is to learn a proper visual object embedding that distinguishes the target objects from others. In the later sections, details about how to learn discriminative multi-modal embeddings are introduced.

3.2 Object-level Visual Embedding Construction

As many entities exist in a visual scene, semantic information of describing entity categories is not enough for distinguishing them. Therefore, it is natural to populate information among all the candidates to facilitate retrieving progress. In this section, on the basis of individual object representation, we further leverage three kinds of object-level relation, i.e., positional relation, text-guided semantic relation, and inter-frame temporal relation.

Positional Relation Module. Positional information is crucial in depicting a object in images/videos. We design a Positional Relation Module (PRM) to encode the object-level spatial information with $p_{i}=(x_{min}^{i},y_{min}^{i},x_{max}^{i},y_{max}^{i},x_{c}^{i},y_{c}^{i},w_{% i},h_{i},r_{i}^{x},r_{i}^{y})$ , where $(x_{min}^{i},y_{min}^{i})$ , $(x_{max}^{i},y_{max}^{i})$ , $(x_{c}^{i},y_{c}^{i})$ , $w_{i}$ and $h_{i}$ are the normalized top-left coordinates, bottom-right coordinates, center coordinates, width and height of the smallest circumscribed box of segment $o_{i}$ , respectively. The last two dimensions $r_{i}^{x}$ and $r_{i}^{y}$ are the normalized relative position index according to the x-axis and y-axis coordinates. Then, spatial enhanced object features $\mathcal{V}_{i}$ is calculated from:

\displaystyle\mathcal{V}_{i}=v_{i}+W_{p}(p_{i}),

(4)

where $W_{p}$ is a learnable matrix. With the explicitly modeling of relative position information, our network earns the ability to handle the referring like “the second from the left” which is hard for low-level networks to infer with pixel-level spatial encoding only. We show by experiments that even with such slight guidance, our model does learn to comprehend relative spatial descriptions. During the training phase, to enhance the model to be position-aware, we randomly horizontal flip the frame image and swap the corresponding direction textual descriptions (e.g., changing from “right” to “left”).

Text-guided Semantic Relation Module. On formulating intra-frame object-level relationships, a simple idea is to employ the common-used vanilla attention module as a relation formulator. However, vanilla attention boosts information exchange naively based on feature similarity, which is hard to distinguish with the within-modal similarity. To diminish the gap, we introduce text guidance into the semantic relation module. As the referring example illustrated in Figure 1, comparing with directly forming the relationship between “Guy” and “Ball”, it is easier for the network to infer the “Ball” if it has already known that the “Guy” is “Kicking” something. Based on the aforementioned motivation, we devise a Text-guided Semantic Relation Module (TSRM) to leverage relational expression in language description. As illustrated in Figure 3, TSRM takes concatenated object features $f_{V}=(\mathcal{V}_{1},\mathcal{V}_{2},...,\mathcal{V}_{N_{v}})$ and the sentence $L$ as inputs. TSRM first learns self-guided weights for representing relationship-aware linguistic feature $f_{t}$ by,

\displaystyle f_{o}

\displaystyle=W_{o}(\texttt{Concat}(f_{V},f_{t})),

(5)

where $\texttt{Concat}(\textperiodcentered,\textperiodcentered)$ represents the concatenation operation along the channel axis, $W_{o}$ is a learnable matrix. Next in relation stage, the original visual feature is utilized to query the multi-modal feature. Given the query $f_{q}=W_{q}f_{V}$ , the key $f_{k}=W_{k}f_{o}$ , and the value $f_{v}=W_{v}f_{o}$ , the process to get text-guided object features $F_{o}=(\mathcal{V}_{1}^{r},\mathcal{V}_{2}^{r},...,\mathcal{V}_{N_{v}}^{r})$ could be formulated as,

F_{o}=f_{V}+\texttt{softmax}(\frac{f_{q}f_{k}^{T}}{\sqrt{d_{k}}})f_{v},

(6)

where $\mathcal{V}_{i}^{r}$ is the relation enhanced object feature and $d_{k}$ is the channel dimension of $f_{k}$ . During the procedure, each object could earn query-focused global context information especially when there is a strong response between the visual object and linguistic description.

Temporal Relation Module. To deal with blurry or complicated scenes in a video, a natural idea is to borrow the confident judgment from a clear scene for tackling hard scenes. In this part, we employ a tracking-based strategy to meet this purpose. Particularly, cross-frame objects are associated based on visual similarities to form a track. Given any two object ( $x_{i}$ , $x_{j}$ ) from adjacent frames, the association similarity $\mathcal{S}_{s}$ is formulated as follows:

\displaystyle\mathcal{S}_{s}(x_{i},x_{j})

\displaystyle=\mathcal{S}_{c}(\mathcal{V}_{i}^{r},\mathcal{V}_{j}^{r})+\alpha*% U(x_{i},x_{j}),

(7)

where $\mathcal{S}_{c}$ denotes the cosine similarity and $U$ represents the mask IoU. Following the multi-object tracking strategy in [46], we maintain several active tracks initialized from the first frame by treating each candidate object as an exclusive track. For each frame, we compute the visual similarity between the all active tracks and all candidate object embeddings in the current frame according to Eq. 7. The association procedure is allowed, only when the visual similarity is greater than a threshold $\gamma$ and the active track would be ended if it is not updated for $\beta$ matching rounds. The Hungarian algorithm [25] is applied to perform multi-object matching. After the aforementioned procedure, unassigned segments will start new tracks and repeat the object association until the final frame. Since the target object may appear or disappear in internal frames, it’s reasonable to allow intermittent tracks.

Methods	Overlap					mAP	IoU		FPS
Methods	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	0.5:0.95	Overall	Mean	FPS
Hu et al. [14] ECCV16	34.8	23.6	13.3	3.3	0.1	13.2	47.4	35.0	-
Li et al. [26] CVPR17	38.7	29.0	17.5	6.6	0.1	16.3	51.5	35.4	-
Gavrilyuk et al. [10] CVPR18	53.8	43.7	31.8	17.1	2.1	26.9	57.4	48.1	-
Wang et al. [39] ICCV19	55.7	45.9	31.9	16.0	2.0	27.4	60.1	49.0	8.64
McIntosh et al. [33] CVPR20	52.6	45.0	34.5	20.7	3.6	30.3	56.8	46.0	-
Wang et al. [38] AAAI20	60.7	52.5	40.5	23.5	4.5	33.3	62.3	53.1	7.18
Ning et al. [34] IJCAI20	63.4	57.9	48.3	32.2	8.3	38.8	66.1	52.9	5.42
Ours	70.4	67.7	61.7	48.9	17.1	49.4	63.1	59.9	9.27

Table 1: Comparison with state-of-the-art methods on the A2D Sentences using IoU and Precision@K as metrics.

Methods	Overlap					mAP	IoU
Methods	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	0.5:0.95	Overall	Mean
Hu et al. [14] ECCV16	63.3	35.0	8.5	0.2	0.0	17.8	54.6	52.8
Li et al. [26] CVPR17	57.8	33.5	10.3	0.6	0.0	17.3	52.9	49.1
Gavrilyuk et al. [10] CVPR18	71.2	51.8	26.4	3.0	0.0	26.7	55.5	57.0
Wang et al. [39] ICCV19	75.6	56.4	28.7	3.4	0.0	28.9	57.6	58.4
McIntosh et al. [33] CVPR20	67.7	51.3	28.3	5.1	0.0	26.1	53.5	55.0
Wang et al. [38] AAAI20	74.2	58.7	31.6	4.7	0.0	30.1	55.4	57.6
Ning et al. [34] IJCAI20	69.1	57.2	31.9	6.0	0.1	29.4	-	-
Ours	88.0	79.6	56.6	14.7	0.2	43.3	64.4	65.5

Table 2: Comparison with state-of-the-arts on the J-HMDB Sentences with the best model trained on A2D Sentences without finetuning.

3.3 Training and Inference

Once all the comprehension procedure has been done, the final set of visual embeddings $\{\mathcal{V}_{1}^{r},\mathcal{V}_{2}^{r},\cdots,\mathcal{V}_{N_{v}}^{r}\}$ and linguistic embedding $\mathcal{L}$ are obtained. We calculate the cosine similarity between the linguistic embedding and each visual candidate, ${\mathcal{V}_{i}^{r}}^{\texttt{T}}\mathcal{L}$ . Here we enforce all vectors to be L2-normalized feature embeddings, i.e., $||\mathcal{V}_{i}^{r}||=1$ , $||\mathcal{L}||=1$ . We adopt the contrastive learning loss for optimizing the model,

	$\displaystyle s_{i}$	$\displaystyle=\frac{\exp({\mathcal{V}_{i}^{r}}^{\texttt{T}}\mathcal{L}/\tau)}{% \sum_{j=1}^{N_{v}}\exp({\mathcal{V}_{j}^{r}}^{\texttt{T}}\mathcal{L}/\tau)},$		(8)
	loss	$\displaystyle=-\log(s_{gt}),$		(9)

where $s_{gt}$ is the matching score of the ground-truth object, $\tau$ is a temperature parameter that controls the concentration level of the distribution. Higher $\tau$ leads to a softer probability distribution. We set $\tau=0.1$ in our experiments.

During the inference phase, our network first extracts multi-modal embeddings for each frame. Then, Temporal Relation Module is conducted to obtain the candidate tracks. The final track is retrieved by choosing the candidate track with the highest matched candidate.

Besides, as a prerequisite for the object association step, visual embeddings belonging to the same object are implicitly pulled together since they are all expected to be close with the same linguistic embedding. More explanations are conducted in supplementary materials.

Methods	Overlap					mAP	IoU
Methods	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	0.5:0.95	Overall	Mean
Top-Down Pipeline	64.4	62.1	56.9	45.3	16.4	45.6	58.4	55.1
+ Absolute PE	66.8	64.2	58.7	46.9	16.4	47.0	60.6	56.9
+ Relative PE	66.3	63.9	60.2	46.9	16.3	46.9	60.2	56.7
+ PRM	67.2	64.5	58.9	47.2	16.4	47.2	61.1	57.4
+ PRM + Vanilla-attention	67.5	64.6	59.2	47.3	16.6	47.3	61.0	57.6
+ PRM + TSRM	68.6	66.0	60.2	48.1	16.8	48.3	62.3	58.6
+ PRM + TSRM + TRM	70.4	67.7	61.7	48.9	17.1	49.4	63.1	59.9

Table 3: Ablation studies on A2D Sentences. PE indicates Position Encoding. Positional Relation Module, Text-guided Semantic Relation Module, and Temporal Relation Module are abbreviated as “PRM”, “TSRM”, and “TRM”, respectively.

4 Experiments

4.1 Datasets and Evaluation Criteria

We conduct our experiments on two extended datasets: A2D Sentences and J-HMDB Sentences. These datasets are released in [10] by additionally providing corresponding human natural descriptions on original A2D [45] and J-HMDB [20] respectively. A2D Sentences contains 3782 videos in total with 8 action classes performed by 7 actor classes. Each video in A2D has 3 to 5 frames annotated with pixel-level actor-action segmentation masks. Besides, it contains 6,655 sentences corresponding to actors and their actions. Following settings in [39], we split the whole dataset into 3017 training videos, 737 testing videos, and 28 unlabeled videos. J-HMDB Sentences contains 928 short videos with 928 corresponding sentences describing 21 different action classes. Pixel-wise 2D articulated human puppet masks are provided for evaluating segmentation performance.

The proposed method is evaluated with the criteria of Intersection-over-Union (IoU) and precision. The overall IoU computes the ratio of the total intersection area divided by the total union area over testing samples. The mean IoU is the averaged IoU over all samples, which treats samples of different sizes equally. We also measure precision@K which considers the percentage of testing samples whose IoU scores are higher than threshold K at 5 different IoU thresholds and calculate mean average precision over 0.50:0.05:0.95 [10].

4.2 Implementation Details

Our network is built on a one-stage instance segmentation method named CondInst [37] for balanced performance and speed. It could be replaced with any other instance segmentation network. This model is initialized from ResNet101 [12] pre-trained on ImageNet [8] and further trained exclusively on A2D [45]. Note that we do not leverage any additional data/annotations when building the instance segmentation module.

For visual and linguistic feature extractor, we adopt ResNet50 [12] model pre-trained on ImageNet [8] as visual backbone and bi-LSTM [17] as text encoder. All input frames are resized to $320\times 320$ . Following the settings in [10], the maximum length of sentences is set to 20 and the dimension of word vector is 1000. We employ the hidden states of bi-LSTM [17] as sentence features with a dimension of 2000. The word embeddings are initialized with one-hot vectors without any pre-trained weights applied. The cross frame entity association threshold $\gamma$ is set to $0.8$ by default. Training is done with Adam optimizer [24] with an initial learning rate of $0.0001$ , and a scheduler that waits for $2$ epochs after loss stagnation to reduce the learning rate by a factor of $10$ . The batch size is $16$ .

4.3 Comparison with State-of-the-Art Methods

We compare our ClawCraneNet with other state-of-the-art text-based video segmentation models following the settings in [10] on the two datasets, i.e., A2D Sentences and J-HMDB Sentences. The comparison results are demonstrated in Table 1 and Table 2. First on A2D Sentences, we evaluate [14, 26] pre-trained on ReferIt dataset [23] and then fine-tuned version on A2D sentences. Other methods including ours are trained on A2D Sentences exclusively. As shown in Table 1, with the help of object-level relation comprehension, our approach achieves state-of-the-art performance on most metrics with a remarkable margin, especially at higher IoU thresholds. On $[email protected]$ , our method outperforms the SOTA by a large margin of 16.7 $\%$ . Moreover, we bring 6.8 $\%$ improvement on Mean IoU and 10.4 $\%$ in mAP over SOTA respectively, which directly proves the effectiveness of our method. In spite of such obvious achievement on mean IoU, we get relatively poor performance on overall IoU. Owing to the special favor of large objects, overall IoU lacks the perception for smaller objects which is crucial for reflecting model performance. It seems ClawCraneNet not only captures obvious larger objects but also learns the object-level semantic context for distinguishing small objects. Besides, we found our method is also efficient (high FPS) compared to other bottom-up methods.

On J-HMDB Sentences, for fair comparisons, we follow the setting in [39, 38, 33, 10], and evaluate our model pre-trained on A2D sentences without any additional fine-tuning, which is kept the same as other compared methods. Our approach significantly outperforms previous state-of-the-art methods on all metrics considered. For the result of $[email protected]$ , one possible reason is that the ground truth masks from J-HMDB Sentences are generated from puppets. It’s hard to fit the data distribution with a model trained from precise segmentation masks.

Methods	Backbone	mAP	IoU
Methods	Backbone	0.5:0.95	Overall	Mean
CondInst	R-101-FPN	49.4	63.1	59.9
CondInst	R-50-FPN	48.3	62.7	59.4
Mask R-CNN	R-50-FPN	48.1	62.4	58.8

Table 4: Impact of Instance Segmentation Modules. With weaker modules, we still get better performance than SOTAs.

4.4 Ablation Studies

Effectiveness of Top-Down Pipeline. We first investigate the effectiveness of our designed Top-Down Pipeline. We evaluated the basic top-down pipeline (segment-embed-retrieve pipeline), which ignores all the relation information among candidate objects and removes all relation-based modules. The results are reported the results in Table 3. Compared to state-of-the-art bottom-up methods shown in Table 1, our top-down pipeline achieves significantly better performances, especially on high precision predictions.

Impact of Instance Segmentation Modules. In our experiments, for better trade-off between time cost and performance, we employ a one-stage segmentation method [37]. The performance of our ClawCraneNet with different instance segmentation methods is shown in Table 4. We use CondInst(R-101-FPN) as the off-the-shelf instance segmentation module by default. With weaker instance segmentation models, we still shows competitive performance compared with bottom-up approaches.

Impact of Positional Relation Module. As shown in Table 3, we have tried to adding different positional encoding methods, and achieved significant improvements compared to the basic top-down pipeline. We can conclude that position information is very useful for the top-down framework of text-based video segmentation task. In addition, we found our full Positional Relation Module (PRM) achieves better performances compared to absolute and relative position encoding methods. The reason is that absolute position encoding lacks the sensation for relative description like “the second from left” and thus it is hard for exclusively relative encoding to balance the weights for formulating absolute information and relative information.

Impact of Text-guided Semantic Relation Module. We further evaluate effectiveness of the proposed TSRM. As shown in Table 3 (5-6 row), the vanilla object-level self-attention does not benefit the performance. But certain improvement occurs when introducing text guide to formulate object-level relations. A possible reason is that the plain visual-based relation module cannot correctly gather relational information by just measuring context similarities. But with more linguistic information contained, concrete relations could be formulated, leading to a positive impact on the performance.

Impact of Temporal Relation Module. By fully utilizing the temporal coherence, our ClawCraneNet with temporal relation module (the last row in Table 3) further enhanced outperforms all the other models which validate the effects of the design. Conclusively, these results confirm the merits of the object-level relation formulation again.

4.5 Qualitative Analysis

We would like to investigate the internal mechanism in ClawCraneNet by analyzing qualitative results. Compared with the bottom-up method ((b) of Figure 4, Figure 5), our top-down pipeline shows reasonable segmentation results, while the other messes up the relational information and lead to ambiguous foreground masks. As shown in Figure 4 (c) and (d), when introducing the text-guided semantic relation module, our network learns to capture mutual information of corresponding objects. Visualization examples of text-guided attention weights are shown in Figure 4 (f). The comparisons between complete ClawCraneNet and alternative structures are illustrated in Figure 5. Only a part of the objects are labeled with language description in A2D Sentences, and we illustrate the unmentioned objects in Figure 5 with purple masks. Without the relative position module, the model tends to focus on objects that match the absolute position description “left”, resulting in a wrong prediction. Ourtemporal relation module helps to achieve temporal consistency among frames, and correct the misunderstanding of the previous module (Figure 5 (e)). In conclusion, these visualized results show the effectiveness of our top-down design and the object-relational modules in ClawCraneNet.

5 Conclusion

In this paper, we propose a novel ClawCraneNet following the segment-comprehend-retrieve strategy for the first attempt of introducing object-level relation into text-based video segmentation field. Different from previous bottom-up methods, our ClawCraneNet maximizes the semantic information flow between object-level features by fully investigating the relationship between intra-frame and inter-frame objects, i.e., positional relation, text-guided semantic relation, and inter-frame temporal relation. Evaluations on commonly used benchmark datasets demonstrate that ClawCraneNet surpasses all the state-of-the-art methods by large margins.

Appendix A Appendix

A.1 Analysis of Puppet Mask in J-HMDB

In this section, we give a brief explanation about the poor performance of $[email protected]$ on J-HMDB Sentences [20] dataset (Line 8 in Table 2). As shown in Figure 6, ground-truth masks from J-HMDB Sentences [20] are performed by puppets which leads to inconsistency between the segmentation mask and the actual object. Since the evaluated model is trained from precise segmentation masks on A2D Sentences [45], it is hard for ClawCraneNet to fit the data distribution of J-HMDB without fine-tuning.

A.2 Analysis of Object Embedding

We visualize predicted embeddings of 12 language queries for all candidate objects on a randomly selected video in A2D Sentences [45] validation set and use t-SNE [31] to embed visual object embeddings (256-dim) into a 2D space. As shown in Figure 7, embeddings belonging to different objects have clearly distinguishable margins, confirming that ClawCraneNet learns discriminative object embeddings which meets the prerequisites of temporal relation module.

A.3 More Details about Predicted Results

In Figure 8, we detail the predictions of our ClawCraneNet. In each sub-figure, we give the original RGB frame and language query as input, then plot the predicted results of our ClawCraneNet and ACGA [39]. All candidates refers to candidate objects perceived by the instance segmentation module.

Thanks to the object-level comprehension of ClawCraneNet, reasonable results with clear boundaries are generated. Specifically, compared to bottom-up methods which mainly focus on salient objects, our network could entirely perceive inconspicuous objects and distinguish them with semantic context. Some failure cases are shown in Figure 9, it is still hard for ClawCraneNet to handle some visual ambiguity, i.e., objects moving at high speed (Figure 9 (a)) or objects in the mirror (Figure 9 (c)).

An interesting observation is that with ambiguous language description like Man in red running, multiple fitting objects can be highlighted with higher similarity scores, as shown in Figure 9 (b). To give more examples, we supply some results in form of videos within the supplementary material.

References

[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077–6086, 2018.
[2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pages 3674–3683, 2018.
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, pages 2425–2433, 2015.
[4] Mohit Bajaj, Lanjun Wang, and Leonid Sigal. G3raphground: Graph-based language grounding. In ICCV, pages 4281–4290, 2019.
[5] Guy Thomas Buswell. How people look at pictures: a study of the psychology and perception in art. 1935.
[6] Ding-Jie Chen, Songhao Jia, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-Luh Liu. See-through-text grou** for referring image segmentation. In ICCV, pages 7454–7463, 2019.
[7] Yi-Wen Chen, Yi-Hsuan Tsai, Tiantian Wang, Yen-Yu Lin, and Ming-Hsuan Yang. Referring expression object segmentation with caption-aware consistency. In BMVC, page 263, 2019.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
[9] Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. Video captioning with attention-based lstm and semantic consistency. TMM, 19(9):2045–2055, 2017.
[10] Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In CVPR, pages 5958–5966, 2018.
[11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[13] John M Henderson and Taylor R Hayes. Meaning-based guidance of attention in scenes as revealed by meaning maps. Nature Human Behaviour, 1(10):743–747, 2017.
[14] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In ECCV, pages 108–124, 2016.
[15] Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. Bi-directional relationship inferring network for referring image segmentation. In CVPR, pages 4424–4433, 2020.
[16] Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. Referring image segmentation via cross-modal progressive comprehension. In CVPR, pages 10488–10497, 2020.
[17] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
[18] Tianrui Hui, Shaofei Huang, Si Liu, Zihan Ding, Guanbin Li, Wenguan Wang, Jizhong Han, and Fei Wang. Collaborative spatial-temporal modeling for language-queried video actor segmentation. In CVPR, pages 4187–4196, 2021.
[19] Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, and Jizhong Han. Linguistic structure guided context modeling for referring image segmentation. In ECCV, pages 59–75, 2020.
[20] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
[21] David A Kalkstein, Leor M Hackel, and Yaacov Trope. Person-centered cognition: The presence of people in a visual scene promotes relational reasoning. Journal of Experimental Social Psychology, 90:104009, 2020.
[22] Yun-Ching Kao, Emily S Davis, and John DE Gabrieli. Neural correlates of actual and predicted memory formation. Nature neuroscience, 8(12):1776–1783, 2005.
[23] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
[25] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
[26] Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. Tracking by natural language specification. In CVPR, pages 6495–6503, 2017.
[27] Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, and Yi Yang. Local-global context aware transformer for language-guided video segmentation. IEEE TPAMI, 45(8):10055–10069, 2023.
[28] Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. In ICCV, pages 1271–1280, 2017.
[29] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
[30] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In NeurIPS, pages 289–297, 2016.
[31] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 2008.
[32] Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. Dynamic multimodal instance segmentation guided by natural language queries. In ECCV, pages 630–645, 2018.
[33] Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. Visual-textual capsule routing for text-based video segmentation. In CVPR, pages 9942–9951, 2020.
[34] Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. Polar relative positional encoding for video-language segmentation. In IJCAI, 2020.
[35] Shuang Qiu, Yao Zhao, Jianbo Jiao, Yunchao Wei, and Shikui Wei. Referring image segmentation by generative adversarial learning. TMM, 22(5):1333–1344, 2019.
[36] Arka Sadhu, Kan Chen, and Ram Nevatia. Video object grounding using semantic roles in language description. In CVPR, pages 10417–10427, 2020.
[37] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In ECCV, 2020.
[38] Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. Context modulated dynamic networks for actor and action video segmentation with language queries. In AAAI, pages 12152–12159, 2020.
[39] Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In ICCV, pages 3939–3948, 2019.
[40] Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. Video captioning via hierarchical reinforcement learning. In CVPR, pages 4213–4222, 2018.
[41] Jeremy M Wolfe and Todd S Horowitz. Five factors that guide attention in visual search. Nature Human Behaviour, 1(3):1–8, 2017.
[42] Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, and Subhransu Maji. Phrasecut: Language-based image segmentation in the wild. In CVPR, pages 10216–10225, 2020.
[43] Yu Wu, Linchao Zhu, Lu Jiang, and Yi Yang. Decoupled novel object captioner. In ACM MM, 2018.
[44] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. Dual attention matching for audio-visual event localization. In ICCV, 2019.
[45] Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J Corso. Can humans fly? action understanding with multiple classes of actors. In CVPR, pages 2264–2273, 2015.
[46] Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang, Shilei Wen, Errui Ding, and Liusheng Huang. Segment as points for efficient online multi-object tracking and segmentation. In ECCV, 2020.
[47] Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one-stage approach to visual grounding. In ICCV, pages 4683–4693, 2019.
[48] Zhengyuan Yang, Tushar Kumar, Tianlang Chen, **gsong Su, and Jiebo Luo. Grounding-tracking-integration. TCSVT, 2020.
[49] Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. Cross-modal self-attention network for referring image segmentation. In CVPR, pages 10502–10511, 2019.
[50] Quanzeng You, Hailin **, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. In CVPR, pages 4651–4659, 2016.
[51] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, pages 1307–1315, 2018.
[52] Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, pages 10668–10677, 2020.
[53] Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin, Jianbin Jiao, Xiaojun Chang, and Xiaodan Liang. Vision-dialog navigation by exploring cross-modal memory. In CVPR, pages 10730–10739, 2020.

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation