BAISeg: Boundary Assisted Weakly Supervised Instance Segmentation

Tengbo Wang, Yu Bai HEBUSTS [email protected] , [email protected]

Abstract

How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering algorithms. In this paper, we propose Boundary-Assisted Instance Segmentation (BAISeg), which is a novel paradigm for WSIS that realizes instance segmentation with pixel-level annotations. BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch identifies instances by predicting class-agnostic instance boundaries rather than instance centroids, therefore, it is different from previous DF-based approaches. In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention (DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to enhance the discriminative capacity of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries. Extensive experiments on PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of our approach, and we achieve considerable performance with only pixel-level annotations. The code will be available at https://github.com/wsis-seg/BAISeg.

1 Introduction

Refer to caption — Figure 1: The diagrams illustrate our proposed approach BAISeg, with both label generation and the prediction of instance segmentation masks. Compared to the existing methods, ours does not require instance-level annotations and can be derived from existing semantic segmentation masks, resulting in higher efficiency. Samples are from the PASCAL VOC 2012 dataset Everingham et al. (2010).

Instance segmentation is a task that jointly estimates pixel-level semantic classes and instance-level object masks, and it has made significant progress with the assistance of large datasets. However, it is highly time-consuming and costly to obtain pixel-level and instance-level annotations. To solve the expensive labeling problem, weak annotations such as box-level, point-level, and image-level annotations have been utilized in instance segmentation. These weakly-supervised instance segmentation (WSIS) methods can be roughly categorized into two approaches: the top-bottom approach and the bottom-up approach.

The top-down approach first localizes the region of instances and then extracts instance masks. For instance, Li et al. (2023c); Ge et al. (2022) use image-level annotations and derive instance regions from proposals, therefore simplifying WSIS as a classification problem. Lan et al. (2021); Tian et al. (2021b); Li et al. (2022b) directly utilize box-level annotations and guide the network to learn instance segmentation by a delicate architecture and loss design. Instance-level annotations are used either indirectly or directly in these methods. In contrast, bottom-up methods require learning relationships between pixels from pre-computed or known instance cues. For instance, Ahn et al. (2019); Kim et al. (2022) use pseudo segmentation labels that are derived from Class-Activation Maps and Weakly-Supervised Semantic Segmentations to pinpoint instance locations, whereas Liao et al. (2023); Kim et al. (2022); Cheng et al. (2022b) utilize point supervision of instances directly.

These methods generate displacement fields (DF) by learning the spatial relationship between pixels and further derive instance masks through post-processing algorithms such as clustering or pixel grou**. A common problem of the above works is that the centroids identified via clustering inherently lack stability (even ground truth centroids are provided) and are especially vulnerable to variations in clustering algorithms.

To address the above limitations, we proposed Boundary-Assisted Instance Segmentation (BAISeg), which is a new paradigm that carries out instance segmentation by using pixel-level annotations only. As shown in the “Instance Prediction” of Figure 1, BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch follows a top-down approach to extract instance masks by predicting class-agnostic instance boundaries. The semantic segmentation branch is responsible for deriving semantic masks. Finally, the semantic masks and the class-agnostic instance masks are combined to obtain the instance segmentation results.

In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention(DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to tackle the semantic drift problem Kim et al. (2022) and strengthened the continuity and closedness of the instance boundaries.

As depicted in “annotations generation” of Figure 1, there are two sources for the semantic segmentation masks. This makes our approach very flexible. Even without off-the-shelf proposal techniques, our approach achieves a competitive performance of 62.0% $mAP_{50}$ on VOC 2012 and 33.6% $mAP_{50}$ on the COCO Test-Dev. The main contributions of this paper are as follows:

•

We propose a novel WSIS paradigm – BAISeg. BAISeg utilizes a top-down approach to instance mask extraction, which does not require any instance-level annotations and avoids the limitations of relying on proposal algorithms or estimated centroids.
•

To extract instance masks, we propose the CFM module. CFM decouples the boundary detection branch and segmentation branch. In CFM, we further designed DMA to capture instance boundaries with weak responses.
•

To tackle the semantic drift problem, we introduced Pixel-to-Pixel Contrast to WSIS with a weighted contrast loss function and further improves the perception and closedness of instance boundaries.

2 Related Work

2.1 Instance Edge Detection

Instance segmentation has made significant progress with the help of large datasets. However, the segmentation result of instance boundaries is not satisfactory. Recent works try to utilize the edge information for improvement. SharpContour Zhu et al. (2022) proposes a contour-based boundary refinement method. STEAL Acuna et al. (2019) designed a unified quality metric for mask and boundary to better integrate edge detection and instance segmentation tasks. InstanceCut Kirillov et al. (2016) proposed a novel MultiCut formulation, deriving the optimal partitioning from images to instances through semantic segmentation and instance boundaries. Our work is most similar to that of InstanceCut, but different from all the aforementioned works.First, we achieve segmentation with only pixel-level annotations. Second, our method does not rely on the setting of MultiCut Chopra and Rao (1993) and uses a very simple and effective method for instance mask extraction.

2.2 Weakly-Supervised Instance Segmentation

The primary challenge in Weakly Supervised Instance Segmentation (WSIS) is to obtain instance-level information from weakly annotated labels. To solve this problem, PRM Zhang et al. (2019) introduced Peak Response Maps from existing proposals algorithm Pont-Tuset et al. (2017) to select appropriate segment proposals. By learning the pixel-to-pixel relationships, IRNet Ahn et al. (2019) estimates the image’s displacement fields to extract instance masks. BESTIE Kim et al. (2022) determines the centroid of instances by combining the centroid network and displacement field and then extracts instance masks using clustering algorithms. Point2Mask Li et al. (2023b) uses the instance GT points for supervision and utilizes OT theory to extract pseudo-labels for panoptic segmentation from semantic segmentation and instance boundaries. CIM Li et al. (2023c) alleviates the adverse effects resulting from redundant segmentation, building upon existing proposals. A detailed comparison of the above methods reveals the following issues: (1) Methods based on point supervision indirectly or directly use instance-level annotations. (2) Methods based on ready-made proposals are heavily dependent on existing proposal algorithms. (3) Estimated centroids are particularly susceptible to changes in clustering algorithms.

In this paper, we propose BAISeg to extract class-agnostic instance masks from instance boundary maps. BAISeg uses only pixel-level supervision and does not rely on any proposal algorithms or centroid estimation.

2.3 Contrastive Learning

Contrastive learning is an unsupervised learning method that aims to learn data representations by maximizing the similarity between related samples and minimizing the similarity between unrelated samples. A popular contrastive learning loss function is known as InfoNCE He et al. (2020). Recent works Wang et al. (2021); Zhao et al. (2021) applied contrastive learning to dense pixel prediction tasks to increase intra-class compactness and inter-class separability. Inspired by these works, we introduce Pixel-to-Pixel contrastive learning in the training of BAISeg, effectively enhancing the discriminative ability of the IABD branch.

3 Proposed Method

3.1 Preliminary and Overview

Given a training image ${X}\in\mathbb{R}^{H\times W\times 3}$ with both pixel-level annotation $\hat{{X}}_{{P}}\in\mathbb{R}^{H\times W\times 1}$ and instance-level annotation $\hat{{X}}_{{I}}\in\mathbb{R}^{H\times W\times 1}$ , the objective of instance segmentation model is to predict an instance-level object mask ${M}_{\text{ins }}\in\mathbb{R}^{H\times W\times 1}$ with pixel-level semantic classes. Our BAISeg also shares the same goal. As shown in Figure 2(a), BAISeg is a one-stage framework that mainly has four components:Semantic Segmentation Decoder,Instance-Aware Boundary Detection Module(IABM), Pixel-to-Pixel Contrast(PPC), and a Mask Extraction Pipeline (MEP). The model has two parallel branches. The IABD branch is responsible for distinguishing different instances belonging to the same class by predicting an Instance-Aware Boundary map. Another branch of semantic segmentation aims to obtain the semantic map of the image. Finally, instance segmentation is achieved by combining semantic maps and instance masks.

3.2 Instance Boundary Detection Branch

We propose a novel instance mask extraction method that accomplishes instance segmentation prediction using only pixel-level annotations. In particular, our method identifies instances by instance-aware boundary prediction rather than centroid estimation, therefore does not suffer from instable centroid predictions.

3.2.1 Branch Structure

To obtain accurate instance cues and differentiate different instances of the same class, we propose the IABM as shown in Figure 2 (b) (introduced in Sec. 3.3), PPC (introduced in Sec. 3.4), and MEP (introduced in Sec. 3.6).

An input $X$ is fed to a backbone network (e.g., HRNet-W48 Sun et al. (2019)) to extract the feature map $\mathcal{F}_{b}\in\mathbb{R}^{H\times W\times C}$ . Then $\mathcal{F}_{b}$ is forwarded to the IABM to obtain the instance-aware boundary map $B$ and the boundary feature map $\mathcal{F}_{p}$ . $B$ is the output of the IABM and is processed by the MEP to extract class-agnostic instance masks. $\mathcal{F}_{p}$ is extracted from the $R$ cascaded CFM blocks and is used in pixel-to-pixel contrast to improve the discriminative power of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries.

3.2.2 Loss Fuction

We employ the loss function proposed in Xie and Tu (2015) for training boundary maps. Given a boundary map $B$ and its corresponding ground truth $Y$ , the loss $\mathcal{L}_{\text{B}}$ is calculated as

	$\displaystyle\mathcal{L}_{\text{B}}(B,Y)$	$\displaystyle=-\sum_{i,j}\left(Y_{i,j}\alpha\log\left(B_{i,j}\right)\right.$		(1)
		$\displaystyle\quad+\left.(1-Y_{i,j})(1-\alpha)\log\left(1-B_{i,j}\right)\right),$		(1)

where $B_{i,j}$ and $Y_{i,j}$ are the $(i,j)^{th}$ elements of matrices $B$ and $Y$ , respectively. Moreover, $\alpha=\frac{\left|Y^{-}\right|}{\left|Y^{-}\right|+\left|Y^{+}\right|}$ , where $\left|\cdot\right|$ denotes the number of pixels.

3.3 Instance-Aware Boundary Detection Module

The IABD branch aims to capture instance-aware boundary maps using backbone features, where the clarity and closedness of the boundaries largely limit the performance of instance mask extraction. We designed Instance-Aware Boundary Detection (IABM) to tackle this issue. As shown in Figure 2(b), The IABM consists of R Cascade Fusion Modules (CFM) and a standard $1\times 1$ convolutional layer.

The design of the IABM is based on two observations. Firstly, the class-agnostic boundary detection branch and the semantic segmentation branch are tightly coupled, and their mutual interaction limits the performance of WSIS. Comprising a cross-layer connection and feature map** operations, the CFM enables the IABD branch to learn independent and multi-scaled edge feature representations. Secondly, using semantic contour labels alone is insufficient for estimating the clear boundaries of each instance, for which we designed the DMA to focus on specific task features and capture instance boundary information with weak responses.

3.3.1 Cascade Fusion Module

We designed CFM based on two observations: By a cascaded design, the network can learn task-specific local representations in each stage. Secondly, a cascaded network extracts and refines features gradually, which helps capture multi-scaled edge information. As shown in Figure 2(b), CFM comprises three components: DMA, $1\times 1$ convolution, and FFM. Notably, DMA can be substituted by any other convolutional layer.

3.3.2 Deep Mutual Attention

As a submodule of CFM, Deep Mutual Attention (DMA) aims to capture weakly responsive instance boundary information from deeply fused features. As shown in Figure 2(c), DMA consists of two parts: the embedding operation and the Mutual Attention Unit (MAU). The embedding operation comprises $3\times 3$ convolutional layers followed by Batch Normalization (BN) and ReLU operations. The advantages of DMA encompass two aspects. Firstly, DMA employs the MAU to effectively capture attention features at various stages. Secondly, it integrates these attention features at different stages to capture edge features with weak responses.

The MAU is inspired by the work Tian et al. (2021a). Notably, we introduced a $1\times 1$ convolutional layer and learnable attention weights $\beta$ to the MAU, aiming to dynamically adjust the level of attention for the fused features in each CFM while preserving the original feature information. Figure 3 shows the detailed structure of the MAU. The top and bottom branches are channel-wise and spatial-wise attention blocks, respectively. Given input features $f_{n}{}^{\prime\prime}$ , we compute the channel-wise attention features $\mathcal{F}_{c}$ as follows:

\mathcal{F}_{c}\!=\!\sigma\!\left(\operatorname{MLP}\!\left(\operatorname{% AvgPool}_{c}\left(f_{n}{}^{\prime\prime}\right)\right)\!\!+\!\!\operatorname{% MLP}\!\left(\operatorname{MaxPool}_{c}\!\left(f_{n}{}^{\prime\prime}\right)% \right)\right)\!\!,\!\!

(2)

where $\mathrm{MaxPool}_{c}$ and $\mathrm{AvgPool}$ are the two channel-wise pooling operations, and MLP is the multi-layer perception with one hidden layer to generate the attention features. We also compute the spatial-wise attention features $\mathcal{F}_{s}$ as:

\mathcal{F}_{s}=\sigma\left(\operatorname{Conv}_{7\times 7}\left(\left[% \operatorname{AvgPool}_{s}\left(f_{n}{}^{\prime\prime}\right);\operatorname{% MaxPool}_{s}\left(f_{n}{}^{\prime\prime}\right)\right]\right)\right),

(3)

where $\operatorname{Conv}_{7\times 7}$ is a convolutional layer with kernel size 7 . The Features of identity transformation $\mathcal{F}_{h}$ is computed as:

\mathcal{F}_{h}=\operatorname{Conv}_{1\times 1}(f_{n}{}^{\prime\prime})

(4)

where $\operatorname{Conv}_{1\times 1}$ is a convolutional layer with kernel size 1 . The final attention features $\mathcal{F}_{att}$ are computed as follows:

\mathcal{F}_{att}=\mathcal{F}_{h}+\left(f_{n}{}^{\prime\prime}\times\mathcal{F% }_{c}+f_{n}{}^{\prime\prime}\times\mathcal{F}_{s}\right)\times\beta,

(5)

where $\times$ denotes the dot product operation, and $+$ is the element-wise summation operation.

3.3.3 Feature Fusion Module

Global features and local features are obtained in the upper and lower branches of CFM respectively. By fusing these features, the Feature Fusion Module (FFM) generates boundary-aware global features and refined local features. The detailed structure of FFM is shown in Figure 2(d).

Methods	Backbone	Superv	$mAP_{25}$	$mAP_{50}$	$mAP_{70}$	$mAP_{75}$
Mask R-CNN He et al. (2017)	ResNet-101	$\mathcal{F}$	76.7	67.9	52.5	44.9
EM-Paste Ge et al. (2022)	ResNet-101	$\mathcal{I}$	–	58.4	37.2	–
BESTIE Kim et al. (2022)	HRNet-W48	$\mathcal{I}$	53.5	41.8	28.3	24.2
CIM Li et al. (2023c)	HRNet-W48	$\mathcal{I}$	68.3	52.6	33.7	28.4
BESTIE Kim et al. (2022)	HRNet-W48	$\mathcal{P}$	58.6	46.7	33.1	26.3
AttnShif Liao et al. (2023)	ViT-S	$\mathcal{P}$	68.3	54.4	–	25.4
DiscoBox Lan et al. (2021)	ResNet-101	$\mathcal{B}$	72.8	62.2	45.5	37.5
SIM Li et al. (2023a)	ResNet-50	$\mathcal{B}$	–	65.5	35.6	–
Box2Mask Li et al. (2022b)	ResNet-50	$\mathcal{B}$	38.0	65.9	46.1	38.2
\cdashline1-7[0.8pt/2pt] BAISeg (OCRNet)	HRNet-W48	$\mathcal{P_{M}}$	52.73	44.0	32.1	28.9
BAISeg (Semantic_gd)	HRNet-W48	$\mathcal{P_{M}}$	55.3	48.4	36.9	32.3
BESTIE^† Kim et al. (2022)	HRNet-W48	$\mathcal{P}$	66.4	56.1	36.5	30.20
CIM^† Li et al. (2023c)	ResNet-50	$\mathcal{I}$	68.7	55.9	37.1	30.9
AttnShif^† Liao et al. (2023)	ViT-S	$\mathcal{P}$	70.3	57.1	–	30.4
\cdashline1-7[0.8pt/2pt] BAISeg (Semantic_gd)^⋆	HRNet-W48	$\mathcal{P_{M}}$	59.2	53.0	42.0	37.6
BAISeg (Semantic_gd)^⋆†	HRNet-W48	$\mathcal{P_{M}}$	69.4	62.0	44.0	36.0

Table 1: Comparison of state-of-the-art methods on the PASCAL VOC 2012validation set. “–” represents that the result is not reported in its paper.^† denotes additional training with Mask R-CNN to refine the prediction.^⋆ denotes Inference using the ground truth mask as semantic segmentation. “Superv.” represents the training supervision (

\mathcal{F}

: instance-level mask,

\mathcal{I}

: image-level class label,

\mathcal{P}

: point,

\mathcal{B}

: bounding box,

\mathcal{P_{M}}

: pixel-level mask). BAISeg (OCRNet) and BAISeg (Semantic_gd) indicate that our BAISeg uses the results of OCRNet Yuan et al. (2021) and ground truth mask as semantic segmentation for training respectively.

3.4 Pixel-to-Pixel Contrast

The limited discriminative power of IABD can lead to issues of instance boundaries not being closed, causing closely connected instances of the same class to be merged into a single instance. To enhance the model’s ability to discriminate and perceive instance edges, we introduced the pixel-to-pixel contrast training Wang et al. (2021); Zhao et al. (2021) in BAISeg. PPC’s core idea is to bring similar pixels closer together while repelling dissimilar ones. However, since we use semantic contour labels for sampling, there are a large number of incorrect labels in the background. To mitigate the negative impact of these incorrect labels on the boundary detection branch, we propose a weighted contrastive loss.

3.4.1 Weighted Pixel-to-Pixel Contrast Loss

The data samples in our contrastive loss computation are training image pixels. Given a semantic contour map $Y$ , for a pixel $i$ with its boundary pseudo-label $\bar{c}$ , the positive samples are those pixels that also belong to the same class $\bar{c}$ , while the negatives are the pixels belonging to background class ${c}$ . Our supervised, pixel-wise contrastive loss is defined as:

\small\!\!\!\!\mathcal{L}_{\text{P}}\!=\!\frac{1}{|\mathcal{P}_{i}|}\!\!\sum_{% \bm{i}^{+}\in\mathcal{P}_{i}\!\!}\!\!\!-_{\!}\alpha\log\frac{\exp(\bm{i}\!% \cdot\!\bm{i}^{+\!\!}/\tau)}{\exp(\bm{i}\!\cdot\!\bm{i}^{+\!\!}/\tau)\!+\!(1-% \alpha)\!\!\sum\nolimits_{\bm{i}^{-\!}\in\mathcal{N}_{i}\!}\!\exp(\bm{i}\!% \cdot\!\bm{i}^{-\!\!}/\tau)},\!\!\!

(6)

where $\mathcal{P}_{i}$ and $\mathcal{N}_{i}$ denote pixel embedding collections of the positive and negative samples, respectively, for pixel $i$ . Moreover, $\alpha=\frac{\left|Y^{-}\right|}{\left|Y^{-}\right|+\left|Y^{+}\right|}$ , where $\left|\cdot\right|$ denotes the number of pixels. Note that the positive/negative samples and the anchor $i$ are not restricted to originate from the same image.

3.5 Overall Loss Function

The loss function for our semantic segmentation branch is derived from DeepLabv3 Chen et al. (2018), and we denote it as $\mathcal{L}_{\text{S}}$ . The overall loss function of BAISeg can be formulated as follows:

\mathcal{L}_{\text{Overall}}=\alpha\times\mathcal{L}_{\text{P}}+\beta\times% \mathcal{L}_{\text{B}}+\gamma\times\mathcal{L}_{\text{S}},

(7)

where $\alpha$ , $\beta$ , and $\gamma$ represent the weights for the three loss components, and we set $\alpha=0.3$ , $\beta=50$ , and $\gamma=1$ .

Method Backbone Superv $AP$ $AP_{50}$ $AP_{75}$ COCO val2017 Mask R-CNN He et al. (2017) ResNet-101 $\mathcal{F}$ 35.4 57.3 37.5 CIM Li et al. (2023c) ResNet-50 $\mathcal{I}$ 11.9 22.8 11.1 CIM^† Li et al. (2023c) ResNet-50 $\mathcal{I}$ 17.0 29.4 17.0 BESTIE^† Kim et al. (2022) HRNet-W48 $\mathcal{P}$ 17.7 34.0 16.4 AttnShif Liao et al. (2023) ViT-S $\mathcal{P}$ 19.1 38.8 17.4 Box2Mask Li et al. (2022b) ResNet-50 $\mathcal{B}$ 32.2 54.4 32.8 \cdashline1-6[0.8pt/2pt] BAISeg (Semantic_gd)^⋆ HRNet-W48 $\mathcal{P_{M}}$ 15.8 25.8 15.9 BAISeg (Semantic_gd)^† HRNet-W48 $\mathcal{P_{M}}$ 19.1 33.4 19.0 COCO Test-Dev Mask R-CNN He et al. (2017) ResNet-101 $\mathcal{F}$ 35.7 58.0 37.8 CIM^† Li et al. (2023c) ResNet-50 $\mathcal{I}$ 17.0 29.7 17.0 LIID^† Liu et al. (2020) ResNet-101 $\mathcal{I}$ 16.0 27.1 16.5 BESTIE^† Kim et al. (2022) HRNet-W48 $\mathcal{P}$ 17.8 34.1 16.7 AttnShif Liao et al. (2023) ViT-S $\mathcal{P}$ 19.1 38.9 17.1 \cdashline1-6[0.8pt/2pt] BAISeg (Semantic_gd)^⋆⋆ HRNet-W48 $\mathcal{P_{M}}$ 13.5 22.8 13.8 BAISeg (Semantic_gd)^† HRNet-W48 $\mathcal{P_{M}}$ 19.0 33.6 19.0

Table 2: Comparison with the state-of-the-art methods on COCO dataset.^† denotes additional training with Mask R-CNN to refine the prediction.^⋆ denotes Inference using the ground truth mask as semantic segmentation.^⋆⋆ denotes Inference using the result of mask2former Cheng et al. (2022a) as semantic segmentation . “Superv.” represents the training supervision (

\mathcal{F}

: instance-level mask,

\mathcal{I}

: image-level class label,

\mathcal{P}

: point,

\mathcal{B}

: bounding box,

\mathcal{P_{M}}

: pixel-level mask). BAISeg (Semantic_gd) means that BAISeg employs the result of the ground truth mask as semantic segmentation.

3.6 Mask Extraction Pipeline

The IABD branch aims to extract class-agnostic instance masks from the instance boundary map. In simple terms, we utilize the well-known watershed algorithm to extract instance masks from the boundary map. However, the instance boundaries generated by the IABD branch have the following issues: (1) roughness of the boundaries, (2) closedness of the boundaries, and (3) holes in the masks. Therefore, we propose a specific Mask Extraction Pipeline (MEP) to deal with these problems. The whole pipeline can be divided into three stages. In stage one, Non-Maximum Suppression (NMS) is applied to process the boundary map to derive thinner boundaries. In stage two, a label map is extracted from the instance boundaries by the Connected-Component Labeling (CCL) algorithm HE20091977_CCL. In the third stage, the input image and label map are combined and then processed by the watershed algorithm Vincent and Soille (1991) to derive the class-agnostic instance masks.

To fill the holes in the derived instance masks, we further employed a closing operation. To improve the closedness of the boundaries, the semantic map is utilized as a filter to select only the pixels within the semantic region. We refer to the above additional operations as “refinement”.

4 Experiments

4.1 Datasets and Evaluation Metrics

Following previous methods, we demonstrate the effectiveness of the proposed approach on Pascal VOC 2012 Everingham et al. (2010) and COCO Lin et al. (2014) datasets. The VOC 2012 dataset includes 10,582 images for training and 1,449 images for validation, comprising 20 object categories. The COCO dataset consists of 115K training, 5K validation, and 20K testing images with 80 object categories. We evaluate the performance using the mean Average Precision ( $mAP$ ) with intersection-over-union (IOU) thresholds of 0.25, 0.5, 0.7, and 0.75 for VOC 2012 and Average Precision ( $AP$ ) over IoU thresholds from 0.5 to 0.95 for COCO.

4.2 Implementation Details

We used the PyTorch 1.13 framework with CUDA 11.7, CuDNN 8, and NVIDIA A40 GPUs. We adopt HRNet48 Tian et al. (2021a) as our backbone network. The input image size for training is $416\times 416$ , and we keep the original resolution for evaluation. We train the network with a batch size of 16, the Adam optimizer Kingma and Ba (2017) with a learning rate of $5\times 10^{-5}$ , and polynomial learning rate scheduling Liu et al. (2015). The total number of training iterations is $6\times 10^{4}$ for the VOC 2012 dataset and $48\times 10^{4}$ iterations for the COCO dataset. The training configuration of our Pixel-to-Pixel Contrast follows Wang et al. (2021). Following previous works Liu et al. (2022); Li et al. (2023c), we also generate pseudo labels from BAISeg for training Mask R-CNN.

4.3 Comparison With State-of-the-Art

4.3.1 Results on PASCAL VOC 2012

To illustrate the superior capabilities of BAISeg, we present a comparative analysis with other leading instance segmentation methods on the PASCAL VOC 2012 validation set, as shown in Table 1. Our method is evaluated without any special techniques or tricks. BAISeg experiments on segmentation masks of varying quality, and among these, BAISeg (Semantic_gd) demonstrated the best performance, achieving an accuracy of 55.32% $mAP_{25}$ , 48.40% $mAP_{50}$ , and 36.97% $mAP_{70}$ , and outperforms methods that rely solely on image-level supervision signals. Compared to approaches based on proposals, point-level, and box-level methods, BAISeg (Semantic_gd) also achieves competitive results. BAISeg (Semantic_gd)^⋆†exceeds methods based on image-level, point-level, and proposals. The performance of BAISeg is also competitive compared to fully supervised and box-level methods.

4.3.2 Results on COCO 2017

In Table 2, we conduct a comparison of BAISeg with other leading instance segmentation methods on the COCO validation and Test-Dev datasets. To ensure a fair comparison, none of the methods use additional training data. BAISeg (Semantic_gd)^⋆† has significant improvements over the best image-level methods in terms of $AP_{50}$ , improving by 4% and 3.9% on the validation set and Test-Dev set respectively. Compared to point-level methods, BAISeg performs comparably as well. However, there is still a slight gap when compared to methods trained with stronger localization supervision like Box2Mask and Mask R-CNN. Indeed, our primary focus is on establishing a novel paradigm for WSIS, where quantitative improvement is regarded as a secondary goal.

4.4 Ablation Study

We conduct several ablation studies on the Pascal VOC 2012 dataset to evaluate the effectiveness of each component. In these studies, we use HRNet-W48 as the backbone and omit MRCNN refinement to save time.

4.4.1 Impact of CFM

CFM DMA PPC Refinement $mAP_{50}$ $mAP_{75}$ $\times$ $\times$ $\times$ $\times$ $43.4$ $28.1$ $\checkmark$ $\times$ $\times$ $\times$ $45.3$ $31.3$ $\checkmark$ $\checkmark$ $\times$ $\times$ $47.0$ $32.1$ $\checkmark$ $\checkmark$ $\checkmark$ $\times$ $48.1$ $31.7$ $\checkmark$ $\checkmark$ $\checkmark$ $\checkmark$ $48.4$ $32.3$

Table 3: Effect of the proposed methods: CFM, DMA, PPC, and Refinement.

R GFLOPs(G) Memory(M) $mAP_{50}$ 1 73.3 0.39 44.5 2 158.6 0.88 46.4 3 158.6 1.38 45.7 4 329.4 1.87 47.2

Table 4: For the impact of different cascade numbers

R

in CFM, we set the hidden layer dimension of CFM to 128.

dim GFLOPs(G) Memory(M) $mAP_{50}$ 32 23.9 0.13 44.8 64 86.7 0.48 46.9 128 329.4 1.87 47.2 256 1282.5 7.37 46.7

Table 5: For the impact of different hidden layer dimensions of CFM, we set the number of CFMs

R=4

backbone dim GFLOPs(G) Memory(M) $mAP_{50}$ ResNet50^∗ 32 55.9 26.1 38.2 ResNet50^∗ 64 62.3 26.8 48.4 ResNet50^∗ 128 75.1 28.2 48.4 ResNet101^∗ 32 68.8 45.1 39.1 HRNet-W34 34 63.7 30.5 44.3 HRNet-W48 48 92.8 65.6 48.4 HRNetV2-W48 48 119.5 69.8 48.0

Table 6: Analysis of the effect of Backbone result on our WSIS performance.^⋆ denotes the use of multi-scale backbone features.

Temperature 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 $mAP_{50}$ 47.4 48.4 47.4 47.0 47.6 47.4 47.7 47.6

Table 7: Impact of Different Temperatures on Losses in the Pixel Contrast Module

The design of CFM helps decoupling the class-agnostic boundary detection branch from the semantic segmentation branch, while progressively extracting and refining boundary features across multiple stages. As shown in Table 3, CFM leads to an improvement of 1.9% in $mAP_{50}$ and 3.2% in $mAP_{75}$ . We also investigate the impact of the number and dimension of cascaded structures in CFM on GFLOPs, Memory, and $mAP50$ . Table 4 shows that the highest $mAP_{50}$ is achieved when R equals 4, while Table 5 reveals that the best results are obtained when the dimension is set to 128. With these two hyper-parameters set, CFM strikes a balance between performance and computational efficiency.

4.4.2 Impact of DMA

The DMA is designed to capture weakly responsive instance boundary information from deeply fused features. As illustrated in the second row of Table 3, DMA leads to an improvement of 1.7% in $mAP_{50}$ and 0.8% in $mAP_{75}$ . As a submodule of CFM, DMA shares the optimal configuration with CFM.

4.4.3 Impact of PPC

The goal of Pixel-to-Pixel Contrast (PPC) is to enhance the perceptual capability of the IABD branch, with its core principle focusing on attracting similar pixels while repelling dissimilar ones. As indicated in the third row of Table 3, PPC leads to an improvement of 1.1% in $mAP_{50}$ and a decrease of 0.4% in $mAP_{75}$ . This decrease in $mAP_{75}$ is attributed to the use of semantic contour pseudo-labels to sample pixels in PPC, resulting in rough predicted boundaries. We evaluate the impact of different temperatures on the weighted Pixel-to-Pixel Contrast loss. As shown in Table 7, the optimal performance is achieved when the temperature is set to $0.2$ .

4.4.4 Impact of Refinement

The Mask Extraction Pipeline is designed to extract class-agnostic instance masks from instance boundaries. In particular, we observed that some instance masks generated by our pipeline exhibit issues such as holes and discontinuities in connectivity. Additionally, the incomplete closeness of instance boundaries leads to the diffusion of instance pixels into the background. The refinement procedure significantly ameliorates the above issues. As shown in the last row of Table 3, the refinement results in a 0.3% and 0.6% improvement in terms of $mAP_{50}$ and $mAP_{75}$ , respectively.

4.4.5 Impact of Backbone

The result in Table 6 illustrates the impact of the Backbone selection on BAISeg performance. We employed HRNet-W48 Tian et al. (2021a) for BAISeg and achieved the best performance of 48.4 $mAP_{50}$ . An analysis of the first five rows of Table 6 reveals that the performance of ResNet50 gradually improves with increasing dimension, reaching saturation at a dimension of 128. Given that the ResNet architectures require multi-scale fusion to match the performance of HRNet-W48, we consider HRNet-W48 more suitable for deployment.

5 Conclusion

This paper introduces BAISeg, a novel WSIS method that bridges the gap between semantic segmentation and instance segmentation via instance boundary detection. To decouple two parallel branches, we propose CFM. Within CFM, we introduced DMA to identify weak responsive instance boundaries. During training, PPC is employed to address issues like the non-closedness of instance boundaries. Our method demonstrates competitive results on both VOC 2012 and COCO datasets. Future work will focus on applying our approach to more challenging computer vision tasks, such as panoptic segmentation Li et al. (2022a).

References

Acuna et al. [2019] David Acuna, Amlan Kar, and Sanja Fidler. Devil is in the edges: Learning semantic boundaries from noisy annotations, 2019.
Ahn et al. [2019] Jiwoon Ahn, Sunghyun Cho, and Suha Kwak. Weakly supervised learning of instance segmentation with inter-pixel relations. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
Cheng et al. [2022a] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation, 2022.
Cheng et al. [2022b] Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Pointly-supervised instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
Chopra and Rao [1993] Sunil Chopra and M. R. Rao. The partition problem. Mathematical Programming, 59:87–115, 1993.
Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010.
Ge et al. [2022] Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Laurent Itti, and Vibhav Vineet. Em-paste: Em-guided cut-paste with dall-e augmentation for image-level weakly supervised instance segmentation, 2022.
He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, 2017.
He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning, 2020.
Kim et al. [2022] Beomyoung Kim, Youngjoon Yoo, Chae Eun Rhee, and Junmo Kim. Beyond semantic to instance segmentation: Weakly-supervised instance segmentation via semantic knowledge transfer and self-refinement. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
Kingma and Ba [2017] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
Kirillov et al. [2016] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. Instancecut: from edges to instances with multicut, 2016.
Lan et al. [2021] Shiyi Lan, Zhiding Yu, Christopher Choy, Subhashree Radhakrishnan, Guilin Liu, Yuke Zhu, Larry S Davis, and Anima Anandkumar. Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision. In IEEE International Conference on Computer Vision, 2021.
Li et al. [2022a] Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022.
Li et al. [2022b] Wentong Li, Wenyu Liu, Jianke Zhu, Miaomiao Cui, Risheng Yu, Xiansheng Hua, and Lei Zhang. Box2mask: Box-supervised instance segmentation via level-set evolution. arXiv preprint arXiv:2212.01579, 2022.
Li et al. [2023a] Ruihuang Li, Chenhang He, Yabin Zhang, Shuai Li, Liyi Chen, and Lei Zhang. Sim: Semantic-aware instance mask generation for box-supervised instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
Li et al. [2023b] Wentong Li, Yuqian Yuan, Song Wang, Jianke Zhu, Jianshu Li, Jian Liu, and Lei Zhang. Point2mask: Point-supervised panoptic segmentation via optimal transport, 2023.
Li et al. [2023c] Zecheng Li, Zening Zeng, Yuqi Liang, and **-Gang Yu. Complete instances mining for weakly supervised instance segmentation. In International Joint Conference on Artificial Intelligence, 2023.
Liao et al. [2023] Mingxiang Liao, Zonghao Guo, Yuze Wang, Peng Yuan, Bailan Feng, and Fang Wan. Attentionshift: Iteratively estimated part-based attention map for pointly supervised instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
Liu et al. [2015] Wei Liu, Andrew Rabinovich, and Alexander C. Berg. Parsenet: Looking wider to see better, 2015.
Liu et al. [2020] Yun Liu, Yu-Huan Wu, Pei-Song Wen, Yu-Jun Shi, Yu Qiu, and Ming-Ming Cheng. Leveraging instance-, image-and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
Liu et al. [2022] Yun Liu, Yu-Huan Wu, Peisong Wen, Yujun Shi, Yu Qiu, and Ming-Ming Cheng. Leveraging instance-, image- and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1415–1428, 2022.
Pont-Tuset et al. [2017] Jordi Pont-Tuset, Pablo Arbelaez, Jonathan T.Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grou** for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):128–140, January 2017.
Sun et al. [2019] Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and **gdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
Tian et al. [2021a] Xin Tian, Ke Xu, Xin Yang, Baocai Yin, and Rynson W. H. Lau. Learning to detect instance-level salient objects using complementary image labels, 2021.
Tian et al. [2021b] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
Vincent and Soille [1991] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(6):583–598, 1991.
Wang et al. [2021] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation, 2021.
Xie and Tu [2015] Saining Xie and Zhuowen Tu. Holistically-nested edge detection, 2015.
Yuan et al. [2021] Yuhui Yuan, Xiaokang Chen, Xilin Chen, and **gdong Wang. Segmentation transformer: Object-contextual representations for semantic segmentation, 2021.
Zhang et al. [2019] Bingfeng Zhang, Jimin Xiao, Yunchao Wei, Mingjie Sun, and Kaizhu Huang. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach, 2019.
Zhao et al. [2021] Xiangyun Zhao, Raviteja Vemulapalli, Philip Mansfield, Boqing Gong, Bradley Green, Lior Shapira, and Ying Wu. Contrastive learning for label-efficient semantic segmentation, 2021.
Zhu et al. [2022] Chenming Zhu, Xuanye Zhang, Yanran Li, Liangdong Qiu, Kai Han, and Xiaoguang Han. Sharpcontour: A contour-based boundary refinement approach for efficient and accurate instance segmentation, 2022.