BAISeg: Boundary Assisted Weakly Supervised Instance Segmentation

Tengbo Wang, Yu Bai HEBUSTS [email protected] , [email protected]
Abstract

How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering algorithms. In this paper, we propose Boundary-Assisted Instance Segmentation (BAISeg), which is a novel paradigm for WSIS that realizes instance segmentation with pixel-level annotations. BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch identifies instances by predicting class-agnostic instance boundaries rather than instance centroids, therefore, it is different from previous DF-based approaches. In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention (DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to enhance the discriminative capacity of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries. Extensive experiments on PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of our approach, and we achieve considerable performance with only pixel-level annotations. The code will be available at https://github.com/wsis-seg/BAISeg.

1 Introduction

Refer to caption
Figure 1: The diagrams illustrate our proposed approach BAISeg, with both label generation and the prediction of instance segmentation masks. Compared to the existing methods, ours does not require instance-level annotations and can be derived from existing semantic segmentation masks, resulting in higher efficiency. Samples are from the PASCAL VOC 2012 dataset Everingham et al. (2010).

Instance segmentation is a task that jointly estimates pixel-level semantic classes and instance-level object masks, and it has made significant progress with the assistance of large datasets. However, it is highly time-consuming and costly to obtain pixel-level and instance-level annotations. To solve the expensive labeling problem, weak annotations such as box-level, point-level, and image-level annotations have been utilized in instance segmentation. These weakly-supervised instance segmentation (WSIS) methods can be roughly categorized into two approaches: the top-bottom approach and the bottom-up approach.

The top-down approach first localizes the region of instances and then extracts instance masks. For instance, Li et al. (2023c); Ge et al. (2022) use image-level annotations and derive instance regions from proposals, therefore simplifying WSIS as a classification problem. Lan et al. (2021); Tian et al. (2021b); Li et al. (2022b) directly utilize box-level annotations and guide the network to learn instance segmentation by a delicate architecture and loss design. Instance-level annotations are used either indirectly or directly in these methods. In contrast, bottom-up methods require learning relationships between pixels from pre-computed or known instance cues. For instance, Ahn et al. (2019); Kim et al. (2022) use pseudo segmentation labels that are derived from Class-Activation Maps and Weakly-Supervised Semantic Segmentations to pinpoint instance locations, whereas Liao et al. (2023); Kim et al. (2022); Cheng et al. (2022b) utilize point supervision of instances directly.

These methods generate displacement fields (DF) by learning the spatial relationship between pixels and further derive instance masks through post-processing algorithms such as clustering or pixel grou**. A common problem of the above works is that the centroids identified via clustering inherently lack stability (even ground truth centroids are provided) and are especially vulnerable to variations in clustering algorithms.

To address the above limitations, we proposed Boundary-Assisted Instance Segmentation (BAISeg), which is a new paradigm that carries out instance segmentation by using pixel-level annotations only. As shown in the “Instance Prediction” of Figure 1, BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch follows a top-down approach to extract instance masks by predicting class-agnostic instance boundaries. The semantic segmentation branch is responsible for deriving semantic masks. Finally, the semantic masks and the class-agnostic instance masks are combined to obtain the instance segmentation results.

In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention(DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to tackle the semantic drift problem Kim et al. (2022) and strengthened the continuity and closedness of the instance boundaries.

As depicted in “annotations generation” of Figure 1, there are two sources for the semantic segmentation masks. This makes our approach very flexible. Even without off-the-shelf proposal techniques, our approach achieves a competitive performance of 62.0% mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT on VOC 2012 and 33.6% mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT on the COCO Test-Dev. The main contributions of this paper are as follows:

  • We propose a novel WSIS paradigm – BAISeg. BAISeg utilizes a top-down approach to instance mask extraction, which does not require any instance-level annotations and avoids the limitations of relying on proposal algorithms or estimated centroids.

  • To extract instance masks, we propose the CFM module. CFM decouples the boundary detection branch and segmentation branch. In CFM, we further designed DMA to capture instance boundaries with weak responses.

  • To tackle the semantic drift problem, we introduced Pixel-to-Pixel Contrast to WSIS with a weighted contrast loss function and further improves the perception and closedness of instance boundaries.

2 Related Work

2.1 Instance Edge Detection

Instance segmentation has made significant progress with the help of large datasets. However, the segmentation result of instance boundaries is not satisfactory. Recent works try to utilize the edge information for improvement. SharpContour Zhu et al. (2022) proposes a contour-based boundary refinement method. STEAL Acuna et al. (2019) designed a unified quality metric for mask and boundary to better integrate edge detection and instance segmentation tasks. InstanceCut Kirillov et al. (2016) proposed a novel MultiCut formulation, deriving the optimal partitioning from images to instances through semantic segmentation and instance boundaries. Our work is most similar to that of InstanceCut, but different from all the aforementioned works.First, we achieve segmentation with only pixel-level annotations. Second, our method does not rely on the setting of MultiCut Chopra and Rao (1993) and uses a very simple and effective method for instance mask extraction.

2.2 Weakly-Supervised Instance Segmentation

The primary challenge in Weakly Supervised Instance Segmentation (WSIS) is to obtain instance-level information from weakly annotated labels. To solve this problem, PRM Zhang et al. (2019) introduced Peak Response Maps from existing proposals algorithm Pont-Tuset et al. (2017) to select appropriate segment proposals. By learning the pixel-to-pixel relationships, IRNet Ahn et al. (2019) estimates the image’s displacement fields to extract instance masks. BESTIE Kim et al. (2022) determines the centroid of instances by combining the centroid network and displacement field and then extracts instance masks using clustering algorithms. Point2Mask Li et al. (2023b) uses the instance GT points for supervision and utilizes OT theory to extract pseudo-labels for panoptic segmentation from semantic segmentation and instance boundaries. CIM Li et al. (2023c) alleviates the adverse effects resulting from redundant segmentation, building upon existing proposals. A detailed comparison of the above methods reveals the following issues: (1) Methods based on point supervision indirectly or directly use instance-level annotations. (2) Methods based on ready-made proposals are heavily dependent on existing proposal algorithms. (3) Estimated centroids are particularly susceptible to changes in clustering algorithms.

In this paper, we propose BAISeg to extract class-agnostic instance masks from instance boundary maps. BAISeg uses only pixel-level supervision and does not rely on any proposal algorithms or centroid estimation.

2.3 Contrastive Learning

Contrastive learning is an unsupervised learning method that aims to learn data representations by maximizing the similarity between related samples and minimizing the similarity between unrelated samples. A popular contrastive learning loss function is known as InfoNCE He et al. (2020). Recent works Wang et al. (2021); Zhao et al. (2021) applied contrastive learning to dense pixel prediction tasks to increase intra-class compactness and inter-class separability. Inspired by these works, we introduce Pixel-to-Pixel contrastive learning in the training of BAISeg, effectively enhancing the discriminative ability of the IABD branch.

3 Proposed Method

Refer to caption
Figure 2: Our proposed BAISeg architecture (a) mainly consists of two parallel branches with a shared backbone: the IABD (b) branch and the semantic segmentation branch. The IABD branch determines the boundaries of instances by predicting instance-aware boundary maps and extracts class-agnostic masks via the Mask Extraction Pipeline. The semantic segmentation branch predicts the semantic maps of instances. Instance segmentations are derived by combining the semantic maps and class-agnostic instance masks. The entire network is optimized by minimizing the PsubscriptP\mathcal{L}_{\text{P}}caligraphic_L start_POSTSUBSCRIPT P end_POSTSUBSCRIPT, BsubscriptB\mathcal{L}_{\text{B}}caligraphic_L start_POSTSUBSCRIPT B end_POSTSUBSCRIPT, and SsubscriptS\mathcal{L}_{\text{S}}caligraphic_L start_POSTSUBSCRIPT S end_POSTSUBSCRIPT losses on pixel-level annotations. The T𝑇Titalic_T function is used to extract semantic contour edge labels from the segmentation masks by spatial gradient deriving. This sample is taken from the validation set of PASCAL VOC 2012.

3.1 Preliminary and Overview

Given a training image XH×W×3𝑋superscript𝐻𝑊3{X}\in\mathbb{R}^{H\times W\times 3}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT with both pixel-level annotation X^PH×W×1subscript^𝑋𝑃superscript𝐻𝑊1\hat{{X}}_{{P}}\in\mathbb{R}^{H\times W\times 1}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT and instance-level annotation X^IH×W×1subscript^𝑋𝐼superscript𝐻𝑊1\hat{{X}}_{{I}}\in\mathbb{R}^{H\times W\times 1}over^ start_ARG italic_X end_ARG start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT, the objective of instance segmentation model is to predict an instance-level object mask Mins H×W×1subscript𝑀ins superscript𝐻𝑊1{M}_{\text{ins }}\in\mathbb{R}^{H\times W\times 1}italic_M start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 1 end_POSTSUPERSCRIPT with pixel-level semantic classes. Our BAISeg also shares the same goal. As shown in Figure 2(a), BAISeg is a one-stage framework that mainly has four components:Semantic Segmentation Decoder,Instance-Aware Boundary Detection Module(IABM), Pixel-to-Pixel Contrast(PPC), and a Mask Extraction Pipeline (MEP). The model has two parallel branches. The IABD branch is responsible for distinguishing different instances belonging to the same class by predicting an Instance-Aware Boundary map. Another branch of semantic segmentation aims to obtain the semantic map of the image. Finally, instance segmentation is achieved by combining semantic maps and instance masks.

3.2 Instance Boundary Detection Branch

We propose a novel instance mask extraction method that accomplishes instance segmentation prediction using only pixel-level annotations. In particular, our method identifies instances by instance-aware boundary prediction rather than centroid estimation, therefore does not suffer from instable centroid predictions.

3.2.1 Branch Structure

To obtain accurate instance cues and differentiate different instances of the same class, we propose the IABM as shown in Figure 2 (b) (introduced in Sec. 3.3), PPC (introduced in Sec. 3.4), and MEP (introduced in Sec. 3.6).

An input X𝑋Xitalic_X is fed to a backbone network (e.g., HRNet-W48 Sun et al. (2019)) to extract the feature map bH×W×Csubscript𝑏superscript𝐻𝑊𝐶\mathcal{F}_{b}\in\mathbb{R}^{H\times W\times C}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT. Then bsubscript𝑏\mathcal{F}_{b}caligraphic_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is forwarded to the IABM to obtain the instance-aware boundary map B𝐵Bitalic_B and the boundary feature map psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. B𝐵Bitalic_B is the output of the IABM and is processed by the MEP to extract class-agnostic instance masks. psubscript𝑝\mathcal{F}_{p}caligraphic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is extracted from the R𝑅Ritalic_R cascaded CFM blocks and is used in pixel-to-pixel contrast to improve the discriminative power of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries.

3.2.2 Loss Fuction

We employ the loss function proposed in Xie and Tu (2015) for training boundary maps. Given a boundary map B𝐵Bitalic_B and its corresponding ground truth Y𝑌Yitalic_Y, the loss BsubscriptB\mathcal{L}_{\text{B}}caligraphic_L start_POSTSUBSCRIPT B end_POSTSUBSCRIPT is calculated as

B(B,Y)subscriptB𝐵𝑌\displaystyle\mathcal{L}_{\text{B}}(B,Y)caligraphic_L start_POSTSUBSCRIPT B end_POSTSUBSCRIPT ( italic_B , italic_Y ) =i,j(Yi,jαlog(Bi,j)\displaystyle=-\sum_{i,j}\left(Y_{i,j}\alpha\log\left(B_{i,j}\right)\right.= - ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_α roman_log ( italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (1)
+(1Yi,j)(1α)log(1Bi,j)),\displaystyle\quad+\left.(1-Y_{i,j})(1-\alpha)\log\left(1-B_{i,j}\right)\right),+ ( 1 - italic_Y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ( 1 - italic_α ) roman_log ( 1 - italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) ,

where Bi,jsubscript𝐵𝑖𝑗B_{i,j}italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and Yi,jsubscript𝑌𝑖𝑗Y_{i,j}italic_Y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are the (i,j)thsuperscript𝑖𝑗𝑡(i,j)^{th}( italic_i , italic_j ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT elements of matrices B𝐵Bitalic_B and Y𝑌Yitalic_Y, respectively. Moreover, α=|Y||Y|+|Y+|𝛼superscript𝑌superscript𝑌superscript𝑌\alpha=\frac{\left|Y^{-}\right|}{\left|Y^{-}\right|+\left|Y^{+}\right|}italic_α = divide start_ARG | italic_Y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_Y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | + | italic_Y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG, where ||\left|\cdot\right|| ⋅ | denotes the number of pixels.

3.3 Instance-Aware Boundary Detection Module

The IABD branch aims to capture instance-aware boundary maps using backbone features, where the clarity and closedness of the boundaries largely limit the performance of instance mask extraction. We designed Instance-Aware Boundary Detection (IABM) to tackle this issue. As shown in Figure 2(b), The IABM consists of R Cascade Fusion Modules (CFM) and a standard 1×1111\times 11 × 1 convolutional layer.

The design of the IABM is based on two observations. Firstly, the class-agnostic boundary detection branch and the semantic segmentation branch are tightly coupled, and their mutual interaction limits the performance of WSIS. Comprising a cross-layer connection and feature map** operations, the CFM enables the IABD branch to learn independent and multi-scaled edge feature representations. Secondly, using semantic contour labels alone is insufficient for estimating the clear boundaries of each instance, for which we designed the DMA to focus on specific task features and capture instance boundary information with weak responses.

3.3.1 Cascade Fusion Module

We designed CFM based on two observations: By a cascaded design, the network can learn task-specific local representations in each stage. Secondly, a cascaded network extracts and refines features gradually, which helps capture multi-scaled edge information. As shown in Figure 2(b), CFM comprises three components: DMA, 1×1111\times 11 × 1 convolution, and FFM. Notably, DMA can be substituted by any other convolutional layer.

3.3.2 Deep Mutual Attention

Refer to caption
Figure 3: Mutual Attention Unit.

As a submodule of CFM, Deep Mutual Attention (DMA) aims to capture weakly responsive instance boundary information from deeply fused features. As shown in Figure 2(c), DMA consists of two parts: the embedding operation and the Mutual Attention Unit (MAU). The embedding operation comprises 3×3333\times 33 × 3 convolutional layers followed by Batch Normalization (BN) and ReLU operations. The advantages of DMA encompass two aspects. Firstly, DMA employs the MAU to effectively capture attention features at various stages. Secondly, it integrates these attention features at different stages to capture edge features with weak responses.

The MAU is inspired by the work Tian et al. (2021a). Notably, we introduced a 1×1111\times 11 × 1 convolutional layer and learnable attention weights β𝛽\betaitalic_β to the MAU, aiming to dynamically adjust the level of attention for the fused features in each CFM while preserving the original feature information. Figure 3 shows the detailed structure of the MAU. The top and bottom branches are channel-wise and spatial-wise attention blocks, respectively. Given input features fn′′f_{n}{}^{\prime\prime}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT, we compute the channel-wise attention features csubscript𝑐\mathcal{F}_{c}caligraphic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as follows:

c=σ(MLP(AvgPoolc(fn)′′)+MLP(MaxPoolc(fn)′′)),\mathcal{F}_{c}\!=\!\sigma\!\left(\operatorname{MLP}\!\left(\operatorname{% AvgPool}_{c}\left(f_{n}{}^{\prime\prime}\right)\right)\!\!+\!\!\operatorname{% MLP}\!\left(\operatorname{MaxPool}_{c}\!\left(f_{n}{}^{\prime\prime}\right)% \right)\right)\!\!,\!\!caligraphic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_σ ( roman_MLP ( roman_AvgPool start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT ) ) + roman_MLP ( roman_MaxPool start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT ) ) ) , (2)

where MaxPoolcsubscriptMaxPool𝑐\mathrm{MaxPool}_{c}roman_MaxPool start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and AvgPoolAvgPool\mathrm{AvgPool}roman_AvgPool are the two channel-wise pooling operations, and MLP is the multi-layer perception with one hidden layer to generate the attention features. We also compute the spatial-wise attention features ssubscript𝑠\mathcal{F}_{s}caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as:

s=σ(Conv7×7([AvgPools(fn)′′;MaxPools(fn)′′])),\mathcal{F}_{s}=\sigma\left(\operatorname{Conv}_{7\times 7}\left(\left[% \operatorname{AvgPool}_{s}\left(f_{n}{}^{\prime\prime}\right);\operatorname{% MaxPool}_{s}\left(f_{n}{}^{\prime\prime}\right)\right]\right)\right),caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_σ ( roman_Conv start_POSTSUBSCRIPT 7 × 7 end_POSTSUBSCRIPT ( [ roman_AvgPool start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT ) ; roman_MaxPool start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT ) ] ) ) , (3)

where Conv7×7subscriptConv77\operatorname{Conv}_{7\times 7}roman_Conv start_POSTSUBSCRIPT 7 × 7 end_POSTSUBSCRIPT is a convolutional layer with kernel size 7 . The Features of identity transformation hsubscript\mathcal{F}_{h}caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is computed as:

h=Conv1×1(fn)′′\mathcal{F}_{h}=\operatorname{Conv}_{1\times 1}(f_{n}{}^{\prime\prime})caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = roman_Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT ) (4)

where Conv1×1subscriptConv11\operatorname{Conv}_{1\times 1}roman_Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT is a convolutional layer with kernel size 1 . The final attention features attsubscript𝑎𝑡𝑡\mathcal{F}_{att}caligraphic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT are computed as follows:

att=h+(fn×′′c+fn×′′s)×β,\mathcal{F}_{att}=\mathcal{F}_{h}+\left(f_{n}{}^{\prime\prime}\times\mathcal{F% }_{c}+f_{n}{}^{\prime\prime}\times\mathcal{F}_{s}\right)\times\beta,caligraphic_F start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT × caligraphic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT × caligraphic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) × italic_β , (5)

where ×\times× denotes the dot product operation, and +++ is the element-wise summation operation.

3.3.3 Feature Fusion Module

Global features and local features are obtained in the upper and lower branches of CFM respectively. By fusing these features, the Feature Fusion Module (FFM) generates boundary-aware global features and refined local features. The detailed structure of FFM is shown in Figure 2(d).

Methods Backbone Superv mAP25𝑚𝐴subscript𝑃25mAP_{25}italic_m italic_A italic_P start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT mAP70𝑚𝐴subscript𝑃70mAP_{70}italic_m italic_A italic_P start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT mAP75𝑚𝐴subscript𝑃75mAP_{75}italic_m italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
Mask R-CNN He et al. (2017) ResNet-101 \mathcal{F}caligraphic_F 76.7 67.9 52.5 44.9
EM-Paste Ge et al. (2022) ResNet-101 \mathcal{I}caligraphic_I 58.4 37.2
BESTIE Kim et al. (2022) HRNet-W48 \mathcal{I}caligraphic_I 53.5 41.8 28.3 24.2
CIM Li et al. (2023c) HRNet-W48 \mathcal{I}caligraphic_I 68.3 52.6 33.7 28.4
BESTIE Kim et al. (2022) HRNet-W48 𝒫𝒫\mathcal{P}caligraphic_P 58.6 46.7 33.1 26.3
AttnShif Liao et al. (2023) ViT-S 𝒫𝒫\mathcal{P}caligraphic_P 68.3 54.4 25.4
DiscoBox Lan et al. (2021) ResNet-101 \mathcal{B}caligraphic_B 72.8 62.2 45.5 37.5
SIM Li et al. (2023a) ResNet-50 \mathcal{B}caligraphic_B 65.5 35.6
Box2Mask Li et al. (2022b) ResNet-50 \mathcal{B}caligraphic_B 38.0 65.9 46.1 38.2
\cdashline1-7[0.8pt/2pt] BAISeg (OCRNet) HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 52.73 44.0 32.1 28.9
BAISeg (Semantic_gd) HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 55.3 48.4 36.9 32.3
BESTIE Kim et al. (2022) HRNet-W48 𝒫𝒫\mathcal{P}caligraphic_P 66.4 56.1 36.5 30.20
CIM Li et al. (2023c) ResNet-50 \mathcal{I}caligraphic_I 68.7 55.9 37.1 30.9
AttnShif Liao et al. (2023) ViT-S 𝒫𝒫\mathcal{P}caligraphic_P 70.3 57.1 30.4
\cdashline1-7[0.8pt/2pt] BAISeg (Semantic_gd) HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 59.2 53.0 42.0 37.6
BAISeg (Semantic_gd)⋆† HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 69.4 62.0 44.0 36.0
Table 1: Comparison of state-of-the-art methods on the PASCAL VOC 2012validation set. “–” represents that the result is not reported in its paper. denotes additional training with Mask R-CNN to refine the prediction. denotes Inference using the ground truth mask as semantic segmentation. “Superv.” represents the training supervision (\mathcal{F}caligraphic_F: instance-level mask, \mathcal{I}caligraphic_I: image-level class label, 𝒫𝒫\mathcal{P}caligraphic_P: point, \mathcal{B}caligraphic_B: bounding box, 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT: pixel-level mask). BAISeg (OCRNet) and BAISeg (Semantic_gd) indicate that our BAISeg uses the results of OCRNet Yuan et al. (2021) and ground truth mask as semantic segmentation for training respectively.

3.4 Pixel-to-Pixel Contrast

The limited discriminative power of IABD can lead to issues of instance boundaries not being closed, causing closely connected instances of the same class to be merged into a single instance. To enhance the model’s ability to discriminate and perceive instance edges, we introduced the pixel-to-pixel contrast training Wang et al. (2021); Zhao et al. (2021) in BAISeg. PPC’s core idea is to bring similar pixels closer together while repelling dissimilar ones. However, since we use semantic contour labels for sampling, there are a large number of incorrect labels in the background. To mitigate the negative impact of these incorrect labels on the boundary detection branch, we propose a weighted contrastive loss.

3.4.1 Weighted Pixel-to-Pixel Contrast Loss

The data samples in our contrastive loss computation are training image pixels. Given a semantic contour map Y𝑌Yitalic_Y, for a pixel i𝑖iitalic_i with its boundary pseudo-label c¯¯𝑐\bar{c}over¯ start_ARG italic_c end_ARG, the positive samples are those pixels that also belong to the same class c¯¯𝑐\bar{c}over¯ start_ARG italic_c end_ARG, while the negatives are the pixels belonging to background class c𝑐{c}italic_c. Our supervised, pixel-wise contrastive loss is defined as:

P=1|𝒫i|𝒊+𝒫iαlogexp(𝒊𝒊+/τ)exp(𝒊𝒊+/τ)+(1α)𝒊𝒩iexp(𝒊𝒊/τ),subscriptPsubscriptabsent1subscript𝒫𝑖subscriptsuperscript𝒊subscript𝒫𝑖𝛼𝒊superscript𝒊𝜏𝒊superscript𝒊𝜏1𝛼subscriptsuperscript𝒊subscript𝒩𝑖𝒊superscript𝒊𝜏\small\!\!\!\!\mathcal{L}_{\text{P}}\!=\!\frac{1}{|\mathcal{P}_{i}|}\!\!\sum_{% \bm{i}^{+}\in\mathcal{P}_{i}\!\!}\!\!\!-_{\!}\alpha\log\frac{\exp(\bm{i}\!% \cdot\!\bm{i}^{+\!\!}/\tau)}{\exp(\bm{i}\!\cdot\!\bm{i}^{+\!\!}/\tau)\!+\!(1-% \alpha)\!\!\sum\nolimits_{\bm{i}^{-\!}\in\mathcal{N}_{i}\!}\!\exp(\bm{i}\!% \cdot\!\bm{i}^{-\!\!}/\tau)},\!\!\!caligraphic_L start_POSTSUBSCRIPT P end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_italic_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - start_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α roman_log divide start_ARG roman_exp ( bold_italic_i ⋅ bold_italic_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG roman_exp ( bold_italic_i ⋅ bold_italic_i start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT / italic_τ ) + ( 1 - italic_α ) ∑ start_POSTSUBSCRIPT bold_italic_i start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( bold_italic_i ⋅ bold_italic_i start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT / italic_τ ) end_ARG , (6)

where 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote pixel embedding collections of the positive and negative samples, respectively, for pixel i𝑖iitalic_i. Moreover, α=|Y||Y|+|Y+|𝛼superscript𝑌superscript𝑌superscript𝑌\alpha=\frac{\left|Y^{-}\right|}{\left|Y^{-}\right|+\left|Y^{+}\right|}italic_α = divide start_ARG | italic_Y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_Y start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | + | italic_Y start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT | end_ARG, where ||\left|\cdot\right|| ⋅ | denotes the number of pixels. Note that the positive/negative samples and the anchor i𝑖iitalic_i are not restricted to originate from the same image.

3.5 Overall Loss Function

The loss function for our semantic segmentation branch is derived from DeepLabv3 Chen et al. (2018), and we denote it as SsubscriptS\mathcal{L}_{\text{S}}caligraphic_L start_POSTSUBSCRIPT S end_POSTSUBSCRIPT. The overall loss function of BAISeg can be formulated as follows:

Overall=α×P+β×B+γ×S,subscriptOverall𝛼subscriptP𝛽subscriptB𝛾subscriptS\mathcal{L}_{\text{Overall}}=\alpha\times\mathcal{L}_{\text{P}}+\beta\times% \mathcal{L}_{\text{B}}+\gamma\times\mathcal{L}_{\text{S}},caligraphic_L start_POSTSUBSCRIPT Overall end_POSTSUBSCRIPT = italic_α × caligraphic_L start_POSTSUBSCRIPT P end_POSTSUBSCRIPT + italic_β × caligraphic_L start_POSTSUBSCRIPT B end_POSTSUBSCRIPT + italic_γ × caligraphic_L start_POSTSUBSCRIPT S end_POSTSUBSCRIPT , (7)

where α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ represent the weights for the three loss components, and we set α=0.3𝛼0.3\alpha=0.3italic_α = 0.3, β=50𝛽50\beta=50italic_β = 50, and γ=1𝛾1\gamma=1italic_γ = 1.

Refer to caption
Figure 4: Visualization results on the COCO 2017 dataset. Comparison with CIM Li et al. (2023c).

Method Backbone Superv AP𝐴𝑃APitalic_A italic_P AP50𝐴subscript𝑃50AP_{50}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75𝐴subscript𝑃75AP_{75}italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT COCO val2017 Mask R-CNN He et al. (2017) ResNet-101 \mathcal{F}caligraphic_F 35.4 57.3 37.5 CIM Li et al. (2023c) ResNet-50 \mathcal{I}caligraphic_I 11.9 22.8 11.1 CIM Li et al. (2023c) ResNet-50 \mathcal{I}caligraphic_I 17.0 29.4 17.0 BESTIE Kim et al. (2022) HRNet-W48 𝒫𝒫\mathcal{P}caligraphic_P 17.7 34.0 16.4 AttnShif Liao et al. (2023) ViT-S 𝒫𝒫\mathcal{P}caligraphic_P 19.1 38.8 17.4 Box2Mask Li et al. (2022b) ResNet-50 \mathcal{B}caligraphic_B 32.2 54.4 32.8 \cdashline1-6[0.8pt/2pt] BAISeg (Semantic_gd) HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 15.8 25.8 15.9 BAISeg (Semantic_gd) HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 19.1 33.4 19.0 COCO Test-Dev Mask R-CNN He et al. (2017) ResNet-101 \mathcal{F}caligraphic_F 35.7 58.0 37.8 CIM Li et al. (2023c) ResNet-50 \mathcal{I}caligraphic_I 17.0 29.7 17.0 LIID Liu et al. (2020) ResNet-101 \mathcal{I}caligraphic_I 16.0 27.1 16.5 BESTIE Kim et al. (2022) HRNet-W48 𝒫𝒫\mathcal{P}caligraphic_P 17.8 34.1 16.7 AttnShif Liao et al. (2023) ViT-S 𝒫𝒫\mathcal{P}caligraphic_P 19.1 38.9 17.1 \cdashline1-6[0.8pt/2pt] BAISeg (Semantic_gd)⋆⋆ HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 13.5 22.8 13.8 BAISeg (Semantic_gd) HRNet-W48 𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT 19.0 33.6 19.0

Table 2: Comparison with the state-of-the-art methods on COCO dataset. denotes additional training with Mask R-CNN to refine the prediction. denotes Inference using the ground truth mask as semantic segmentation.⋆⋆ denotes Inference using the result of mask2former Cheng et al. (2022a) as semantic segmentation . “Superv.” represents the training supervision (\mathcal{F}caligraphic_F: instance-level mask, \mathcal{I}caligraphic_I: image-level class label, 𝒫𝒫\mathcal{P}caligraphic_P: point, \mathcal{B}caligraphic_B: bounding box,𝒫subscript𝒫\mathcal{P_{M}}caligraphic_P start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT: pixel-level mask). BAISeg (Semantic_gd) means that BAISeg employs the result of the ground truth mask as semantic segmentation.

3.6 Mask Extraction Pipeline

The IABD branch aims to extract class-agnostic instance masks from the instance boundary map. In simple terms, we utilize the well-known watershed algorithm to extract instance masks from the boundary map. However, the instance boundaries generated by the IABD branch have the following issues: (1) roughness of the boundaries, (2) closedness of the boundaries, and (3) holes in the masks. Therefore, we propose a specific Mask Extraction Pipeline (MEP) to deal with these problems. The whole pipeline can be divided into three stages. In stage one, Non-Maximum Suppression (NMS) is applied to process the boundary map to derive thinner boundaries. In stage two, a label map is extracted from the instance boundaries by the Connected-Component Labeling (CCL) algorithm HE20091977_CCL. In the third stage, the input image and label map are combined and then processed by the watershed algorithm Vincent and Soille (1991) to derive the class-agnostic instance masks.

To fill the holes in the derived instance masks, we further employed a closing operation. To improve the closedness of the boundaries, the semantic map is utilized as a filter to select only the pixels within the semantic region. We refer to the above additional operations as “refinement”.

4 Experiments

4.1 Datasets and Evaluation Metrics

Following previous methods, we demonstrate the effectiveness of the proposed approach on Pascal VOC 2012 Everingham et al. (2010) and COCO Lin et al. (2014) datasets. The VOC 2012 dataset includes 10,582 images for training and 1,449 images for validation, comprising 20 object categories. The COCO dataset consists of 115K training, 5K validation, and 20K testing images with 80 object categories. We evaluate the performance using the mean Average Precision (mAP𝑚𝐴𝑃mAPitalic_m italic_A italic_P) with intersection-over-union (IOU) thresholds of 0.25, 0.5, 0.7, and 0.75 for VOC 2012 and Average Precision (AP𝐴𝑃APitalic_A italic_P) over IoU thresholds from 0.5 to 0.95 for COCO.

4.2 Implementation Details

We used the PyTorch 1.13 framework with CUDA 11.7, CuDNN 8, and NVIDIA A40 GPUs. We adopt HRNet48 Tian et al. (2021a) as our backbone network. The input image size for training is 416×416416416416\times 416416 × 416, and we keep the original resolution for evaluation. We train the network with a batch size of 16, the Adam optimizer Kingma and Ba (2017) with a learning rate of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, and polynomial learning rate scheduling Liu et al. (2015). The total number of training iterations is 6×1046superscript1046\times 10^{4}6 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT for the VOC 2012 dataset and 48×10448superscript10448\times 10^{4}48 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT iterations for the COCO dataset. The training configuration of our Pixel-to-Pixel Contrast follows Wang et al. (2021). Following previous works Liu et al. (2022); Li et al. (2023c), we also generate pseudo labels from BAISeg for training Mask R-CNN.

4.3 Comparison With State-of-the-Art

4.3.1 Results on PASCAL VOC 2012

To illustrate the superior capabilities of BAISeg, we present a comparative analysis with other leading instance segmentation methods on the PASCAL VOC 2012 validation set, as shown in Table 1. Our method is evaluated without any special techniques or tricks. BAISeg experiments on segmentation masks of varying quality, and among these, BAISeg (Semantic_gd) demonstrated the best performance, achieving an accuracy of 55.32% mAP25𝑚𝐴subscript𝑃25mAP_{25}italic_m italic_A italic_P start_POSTSUBSCRIPT 25 end_POSTSUBSCRIPT, 48.40% mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT, and 36.97% mAP70𝑚𝐴subscript𝑃70mAP_{70}italic_m italic_A italic_P start_POSTSUBSCRIPT 70 end_POSTSUBSCRIPT, and outperforms methods that rely solely on image-level supervision signals. Compared to approaches based on proposals, point-level, and box-level methods, BAISeg (Semantic_gd) also achieves competitive results. BAISeg (Semantic_gd)⋆†exceeds methods based on image-level, point-level, and proposals. The performance of BAISeg is also competitive compared to fully supervised and box-level methods.

4.3.2 Results on COCO 2017

In Table 2, we conduct a comparison of BAISeg with other leading instance segmentation methods on the COCO validation and Test-Dev datasets. To ensure a fair comparison, none of the methods use additional training data. BAISeg (Semantic_gd)⋆† has significant improvements over the best image-level methods in terms of AP50𝐴subscript𝑃50AP_{50}italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT, improving by 4% and 3.9% on the validation set and Test-Dev set respectively. Compared to point-level methods, BAISeg performs comparably as well. However, there is still a slight gap when compared to methods trained with stronger localization supervision like Box2Mask and Mask R-CNN. Indeed, our primary focus is on establishing a novel paradigm for WSIS, where quantitative improvement is regarded as a secondary goal.

4.4 Ablation Study

We conduct several ablation studies on the Pascal VOC 2012 dataset to evaluate the effectiveness of each component. In these studies, we use HRNet-W48 as the backbone and omit MRCNN refinement to save time.

4.4.1 Impact of CFM

CFM DMA PPC Refinement mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT mAP75𝑚𝐴subscript𝑃75mAP_{75}italic_m italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT ×\times× ×\times× ×\times× ×\times× 43.443.443.443.4 28.128.128.128.1 \checkmark ×\times× ×\times× ×\times× 45.345.345.345.3 31.331.331.331.3 \checkmark \checkmark ×\times× ×\times× 47.047.047.047.0 32.132.132.132.1 \checkmark \checkmark \checkmark ×\times× 48.148.148.148.1 31.731.731.731.7 \checkmark \checkmark \checkmark \checkmark 48.448.448.448.4 32.332.332.332.3

Table 3: Effect of the proposed methods: CFM, DMA, PPC, and Refinement.

R GFLOPs(G) Memory(M) mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT 1 73.3 0.39 44.5 2 158.6 0.88 46.4 3 158.6 1.38 45.7 4 329.4 1.87 47.2

Table 4: For the impact of different cascade numbers R𝑅Ritalic_R in CFM, we set the hidden layer dimension of CFM to 128.

dim GFLOPs(G) Memory(M) mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT 32 23.9 0.13 44.8 64 86.7 0.48 46.9 128 329.4 1.87 47.2 256 1282.5 7.37 46.7

Table 5: For the impact of different hidden layer dimensions of CFM, we set the number of CFMs R=4𝑅4R=4italic_R = 4.

backbone dim GFLOPs(G) Memory(M) mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT ResNet50 32 55.9 26.1 38.2 ResNet50 64 62.3 26.8 48.4 ResNet50 128 75.1 28.2 48.4 ResNet101 32 68.8 45.1 39.1 HRNet-W34 34 63.7 30.5 44.3 HRNet-W48 48 92.8 65.6 48.4 HRNetV2-W48 48 119.5 69.8 48.0

Table 6: Analysis of the effect of Backbone result on our WSIS performance. denotes the use of multi-scale backbone features.

Temperature 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT 47.4 48.4 47.4 47.0 47.6 47.4 47.7 47.6

Table 7: Impact of Different Temperatures on Losses in the Pixel Contrast Module

The design of CFM helps decoupling the class-agnostic boundary detection branch from the semantic segmentation branch, while progressively extracting and refining boundary features across multiple stages. As shown in Table 3, CFM leads to an improvement of 1.9% in mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and 3.2% in mAP75𝑚𝐴subscript𝑃75mAP_{75}italic_m italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT. We also investigate the impact of the number and dimension of cascaded structures in CFM on GFLOPs, Memory, and mAP50𝑚𝐴𝑃50mAP50italic_m italic_A italic_P 50. Table 4 shows that the highest mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT is achieved when R equals 4, while Table 5 reveals that the best results are obtained when the dimension is set to 128. With these two hyper-parameters set, CFM strikes a balance between performance and computational efficiency.

4.4.2 Impact of DMA

The DMA is designed to capture weakly responsive instance boundary information from deeply fused features. As illustrated in the second row of Table 3, DMA leads to an improvement of 1.7% in mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and 0.8% in mAP75𝑚𝐴subscript𝑃75mAP_{75}italic_m italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT. As a submodule of CFM, DMA shares the optimal configuration with CFM.

4.4.3 Impact of PPC

The goal of Pixel-to-Pixel Contrast (PPC) is to enhance the perceptual capability of the IABD branch, with its core principle focusing on attracting similar pixels while repelling dissimilar ones. As indicated in the third row of Table 3, PPC leads to an improvement of 1.1% in mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and a decrease of 0.4% in mAP75𝑚𝐴subscript𝑃75mAP_{75}italic_m italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT. This decrease in mAP75𝑚𝐴subscript𝑃75mAP_{75}italic_m italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT is attributed to the use of semantic contour pseudo-labels to sample pixels in PPC, resulting in rough predicted boundaries. We evaluate the impact of different temperatures on the weighted Pixel-to-Pixel Contrast loss. As shown in Table 7, the optimal performance is achieved when the temperature is set to 0.20.20.20.2.

4.4.4 Impact of Refinement

The Mask Extraction Pipeline is designed to extract class-agnostic instance masks from instance boundaries. In particular, we observed that some instance masks generated by our pipeline exhibit issues such as holes and discontinuities in connectivity. Additionally, the incomplete closeness of instance boundaries leads to the diffusion of instance pixels into the background. The refinement procedure significantly ameliorates the above issues. As shown in the last row of Table 3, the refinement results in a 0.3% and 0.6% improvement in terms of mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and mAP75𝑚𝐴subscript𝑃75mAP_{75}italic_m italic_A italic_P start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT, respectively.

4.4.5 Impact of Backbone

The result in Table 6 illustrates the impact of the Backbone selection on BAISeg performance. We employed HRNet-W48 Tian et al. (2021a) for BAISeg and achieved the best performance of 48.4 mAP50𝑚𝐴subscript𝑃50mAP_{50}italic_m italic_A italic_P start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT. An analysis of the first five rows of Table 6 reveals that the performance of ResNet50 gradually improves with increasing dimension, reaching saturation at a dimension of 128. Given that the ResNet architectures require multi-scale fusion to match the performance of HRNet-W48, we consider HRNet-W48 more suitable for deployment.

5 Conclusion

This paper introduces BAISeg, a novel WSIS method that bridges the gap between semantic segmentation and instance segmentation via instance boundary detection. To decouple two parallel branches, we propose CFM. Within CFM, we introduced DMA to identify weak responsive instance boundaries. During training, PPC is employed to address issues like the non-closedness of instance boundaries. Our method demonstrates competitive results on both VOC 2012 and COCO datasets. Future work will focus on applying our approach to more challenging computer vision tasks, such as panoptic segmentation Li et al. (2022a).

References

  • Acuna et al. [2019] David Acuna, Amlan Kar, and Sanja Fidler. Devil is in the edges: Learning semantic boundaries from noisy annotations, 2019.
  • Ahn et al. [2019] Jiwoon Ahn, Sunghyun Cho, and Suha Kwak. Weakly supervised learning of instance segmentation with inter-pixel relations. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  • Cheng et al. [2022a] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation, 2022.
  • Cheng et al. [2022b] Bowen Cheng, Omkar Parkhi, and Alexander Kirillov. Pointly-supervised instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  • Chopra and Rao [1993] Sunil Chopra and M. R. Rao. The partition problem. Mathematical Programming, 59:87–115, 1993.
  • Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010.
  • Ge et al. [2022] Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Laurent Itti, and Vibhav Vineet. Em-paste: Em-guided cut-paste with dall-e augmentation for image-level weakly supervised instance segmentation, 2022.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, 2017.
  • He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning, 2020.
  • Kim et al. [2022] Beomyoung Kim, Youngjoon Yoo, Chae Eun Rhee, and Junmo Kim. Beyond semantic to instance segmentation: Weakly-supervised instance segmentation via semantic knowledge transfer and self-refinement. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  • Kingma and Ba [2017] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
  • Kirillov et al. [2016] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. Instancecut: from edges to instances with multicut, 2016.
  • Lan et al. [2021] Shiyi Lan, Zhiding Yu, Christopher Choy, Subhashree Radhakrishnan, Guilin Liu, Yuke Zhu, Larry S Davis, and Anima Anandkumar. Discobox: Weakly supervised instance segmentation and semantic correspondence from box supervision. In IEEE International Conference on Computer Vision, 2021.
  • Li et al. [2022a] Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022.
  • Li et al. [2022b] Wentong Li, Wenyu Liu, Jianke Zhu, Miaomiao Cui, Risheng Yu, Xiansheng Hua, and Lei Zhang. Box2mask: Box-supervised instance segmentation via level-set evolution. arXiv preprint arXiv:2212.01579, 2022.
  • Li et al. [2023a] Ruihuang Li, Chenhang He, Yabin Zhang, Shuai Li, Liyi Chen, and Lei Zhang. Sim: Semantic-aware instance mask generation for box-supervised instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  • Li et al. [2023b] Wentong Li, Yuqian Yuan, Song Wang, Jianke Zhu, Jianshu Li, Jian Liu, and Lei Zhang. Point2mask: Point-supervised panoptic segmentation via optimal transport, 2023.
  • Li et al. [2023c] Zecheng Li, Zening Zeng, Yuqi Liang, and **-Gang Yu. Complete instances mining for weakly supervised instance segmentation. In International Joint Conference on Artificial Intelligence, 2023.
  • Liao et al. [2023] Mingxiang Liao, Zonghao Guo, Yuze Wang, Peng Yuan, Bailan Feng, and Fang Wan. Attentionshift: Iteratively estimated part-based attention map for pointly supervised instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  • Liu et al. [2015] Wei Liu, Andrew Rabinovich, and Alexander C. Berg. Parsenet: Looking wider to see better, 2015.
  • Liu et al. [2020] Yun Liu, Yu-Huan Wu, Pei-Song Wen, Yu-Jun Shi, Yu Qiu, and Ming-Ming Cheng. Leveraging instance-, image-and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • Liu et al. [2022] Yun Liu, Yu-Huan Wu, Peisong Wen, Yujun Shi, Yu Qiu, and Ming-Ming Cheng. Leveraging instance-, image- and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1415–1428, 2022.
  • Pont-Tuset et al. [2017] Jordi Pont-Tuset, Pablo Arbelaez, Jonathan T.Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grou** for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):128–140, January 2017.
  • Sun et al. [2019] Ke Sun, Yang Zhao, Borui Jiang, Tianheng Cheng, Bin Xiao, Dong Liu, Yadong Mu, Xinggang Wang, Wenyu Liu, and **gdong Wang. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514, 2019.
  • Tian et al. [2021a] Xin Tian, Ke Xu, Xin Yang, Baocai Yin, and Rynson W. H. Lau. Learning to detect instance-level salient objects using complementary image labels, 2021.
  • Tian et al. [2021b] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  • Vincent and Soille [1991] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(6):583–598, 1991.
  • Wang et al. [2021] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation, 2021.
  • Xie and Tu [2015] Saining Xie and Zhuowen Tu. Holistically-nested edge detection, 2015.
  • Yuan et al. [2021] Yuhui Yuan, Xiaokang Chen, Xilin Chen, and **gdong Wang. Segmentation transformer: Object-contextual representations for semantic segmentation, 2021.
  • Zhang et al. [2019] Bingfeng Zhang, Jimin Xiao, Yunchao Wei, Mingjie Sun, and Kaizhu Huang. Reliability does matter: An end-to-end weakly supervised semantic segmentation approach, 2019.
  • Zhao et al. [2021] Xiangyun Zhao, Raviteja Vemulapalli, Philip Mansfield, Boqing Gong, Bradley Green, Lior Shapira, and Ying Wu. Contrastive learning for label-efficient semantic segmentation, 2021.
  • Zhu et al. [2022] Chenming Zhu, Xuanye Zhang, Yanran Li, Liangdong Qiu, Kai Han, and Xiaoguang Han. Sharpcontour: A contour-based boundary refinement approach for efficient and accurate instance segmentation, 2022.