(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Technion – Israel Institute of Technology
11email: [email protected]
22institutetext: Bosch Center for AI
22email: {david.vainshtein,dotan.dicastro}@il.bosch.com
33institutetext: Ben-Gurion University of the Negev
33email: [email protected]

Robot Instance Segmentation with Few Annotations for Gras**

Moshe Kimhi\orcidlink0009-0000-7645-7339 1122    David Vainshtein\orcidlink0000-0003-2015-910X 22    Chaim Baskin\orcidlink0000-0003-4341-5639 1133    Dotan Di Castro\orcidlink0009-0001-0900-3932
Equal contribution
22
Abstract

The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains characterized by cluttered scenes and high object variability, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite temporal gaps without requiring curated data of interaction sequences. As a result, our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images. We validate our method on two common benchmarks, ARMBench [34] mix-object-tote and OCID [37], where it achieves state-of-the-art performance. Notably, on ARMBench, we attain an AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT of 86.3786.3786.3786.37, almost a 20%percent2020\%20 % improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT score of 84.8984.8984.8984.89 with just 1%percent11\%1 % of annotated data compared to 72727272 presented in [34] on the fully annotated counterpart 111Code available in https://github.com/mkimhi/RISE .

1 Introduction

Acquiring accurate instance segmentation masks requires training a model on vast amounts of data with high-quality pixel-level annotations. While collecting raw sensory data (images) is relatively easy, annotating object instance masks down to individual pixels becomes prohibitively expensive when scaling up perception tasks. As a result, models trained on limited annotated data inevitably face challenges when deployed in the real world due to domain variation and evolving environments. This problem is central in robotics, where robots rely on spatial perception extracted from sensory inputs.

To use large amounts of unlabeled data, Semi-Supervised Learning (SSL) assumes that only a portion of the data is labeled: either a subset of observed scenes or some objects within each scene. The model then uses its own predictions as pseudo-labels to extract learning signals from the remaining unlabeled data [36, 48, 52, 39, 50]. Therefore, a model attempting to learn from its own noisy labels early in training may stagnate rather than generalize.

Refer to caption
Figure 1: Method: Pseudo-sequence generation from a single unlabeled image. The input is weakly augmented to produce the “before” image x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and augmented again to yield the “after” image x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To emulate scene interaction, objects are drawn from the object memory bank, transformed, and inserted into the “after” frame. The segmentation model’s task is to simultaneously associate objects that persist between frames (subject to occlusion), maintain consistency of object instance embedding, and correctly predict the ground-truth mask of the added objects.

Looking beyond spatial cues of still images, video sequences contain temporal information that a model can exploit to enforce consistency across frames and improve generalization. Recent advancements focusing on Learning Through Interaction (LTI) highlight the significance of providing the model with temporal perception. LTI enables the model to peer into the underlying dynamics of its domain by observing actions and their consequences [21, 8, 38]. The leading approaches entail observing a scene that undergoes various changes, such as objects being placed or extracted. By constructing the data in the form of “before” and “after” sequences [43, 44, 51, 24, 27], localized changes in illumination, deformation, and articulation of objects allow the model to refine its interpretation of the environment. The leading LTI techniques either prescribe multi-stage training that pre-trains on specialized datasets [27] or restrict input observations to strictly gradual changes at small time intervals [2]. Interestingly, leading methods for Video Image Segmentation regularly resolve long image sequences depicting instances pop** in and out of view [43, 44, 51, 16]. These incorporate a reassociation loss to overcome changes of occlusion, instance pose and appearance. However, learning from videos relies either on significant investment in manual annotation of every object in every video frame or on video frames occurring at sufficiently small time intervals. Our main insight is that although each paradigm compensates for the weakness of the other (SSL lessens the annotation effort while LTI leverages temporal information), naively combining the two amplifies their drawbacks—reinforcing noisy labels across entire sequences. In this work, we propose a solution in the form of a novel framework that incorporates the learning paradigms of SSL and LTI to enhance performance in the few-annotations scenario, in which only a tiny fraction of the dataset is annotated (and the rest is unlabeled). Our method simultaneously addresses the challenges of LTI and SSL. We eliminate the need for specialized datasets required for LTI by using pseudo-sequences generated from still images to mimic scene interaction. We also overcome the main obstacles to SSL by preventing noisy self-predictions from obscuring the learning signal through coupling prediction heads, thus stabilizing predictions early in training. Our framework is model-agnostic, complementing existing (and future) segmentation models with temporal perception through end-to-end training. Additionally, we propose an automated pseudo-label criteria that discards low-quality predictions.

The resulting framework can be considered the first to employ self-supervised learning through interaction, achieving better performance than each paradigm individually. We set a new state-of-the-art on the ARMBench [34] benchmark and OCID [37] (RGB only). Notably, our method trained on 1%percent11\%1 % of annotated data surpasses the performance of the well-established Deformable DETR [53] architecture, even when trained on 10×10\times10 × additional annotated data (improving +16.8616.86+16.86+ 16.86 AP using Swin-L Transformer as feature extractor).

2 Background and Related Work

Of the various approaches to instance segmentation, we are interested in those that excel without full supervision [21, 8]. This section provides an overview of relevant works on partial supervision and learning from sequences.

Partial supervision methods use the few annotated examples available (if any) and maintain consistent predictions for similar objects in the scene [57]. In recent years, most efforts focused on contrastive learning that extracts embedding from object instances and aims to bring same-class embedding closer while pushing other classes further apart [4, 36]. That said, progress in object classification and detection does not readily carry over to image segmentation, where the effectiveness of self-supervision lags behind full-supervision in challenging domains of cluttered objects with many occlusions [54]. Unsurprisingly, these domains are also more complicated for humans to annotate.

Scene modulation is a concept that aims to extract additional learning signal by familiarizing the model with objects that undergo gradual alterations within a scene [41, 47, 24, 42, 7] where objects are viewed in many configurations, as well as different clutter and lighting conditions. This offers a substantial advantage in detecting and identifying objects that may deform or exhibit variations, thereby enhancing the robustness of the segmentation. Note, however, that assembling large dedicated datasets of objects is resource-intensive and challenging to apply effectively to new scenes featuring previously unseen objects. A recent work [47] achieved significant improvement by incorporating simulated data before transferring to real world scenes [15, 24]. The main drawback of using synthetic data is the high cost of creating photorealistic rendering that accurately captures the physical properties of every object in the scene. Often times this results in idiosyncrasies that are picked up by the model and become a source of error when encountering real world data.

Frame sequences offer additional information along the time dimension. As with scene modulation, the model learns to recognize and identify related instances throughout a series of images [2]. Recent advancements in video instance segmentation (VIS) methods, exemplified by SeqFormer [43] and IDOL [44], leverage sequential consistency of instances for online object segmentation and tracking. They employ contrastive loss to ensure that instance representations are distinguishable from other instances in the same frame and over previous frames. In CTVIS [51] the model also taps into future frames.

Learning through interaction pushes the notion of sequences even further by specializing in image sequences that depict predefined and controlled scene manipulation. Consecutive frames in these meticulously assembled datasets exhibit large temporal gaps, unlike video data, and changes are usually confined to localized actions on few object instances [27]. This locality constraint persists through frame sequences, allowing LTI approaches to infer which instances have actually changed and which are merely affected by variations in lighting, occlusion and deformation, as a result of the action performed. The model quickly learns to segment an object that is added or removed, using a few hundred labeled image pairs. Since assembling such specialized datasets requires significant effort, the next stage in training artificially inserts cropped instances from high-confidence mask predictions into unlabeled still-images to emulate interactions. In STOW [24], the model is additionally trained on synthesized virtual scenes and then evaluated on real-world data.

The above advancements present an interesting question: In real-world applications where the model inevitably encounters a changing domain, is it possible to continuously learn (post deployment) without supervision by leveraging the temporal information of video sequences using the causal awareness of LTI?
Importantly, can this be achieved without investing in a proprietary dataset or reliance on a specific instance segmentation model?

3 RISE

We introduce a novel framework called Robot Instance Segmentation for Few-Annotation Gras** (RISE) that unifies learning from temporal signals (through interactions) and spatial signals (through self-supervision). RISE is trained end-to-end on still images, and enables self-supervision to learn from temporal consistency when scene objects are moved, added or removed. Because of this, RISE does not require a meticulously compiled dataset of before and after image pairs of scene interaction, nor does it require a large dataset of labeled instances — thus it is more readily capable of handling domain variations that commonly occur in the real world.

Refer to caption
Figure 2: RISE framework (from left to right). Given an unlabeled image x𝑥xitalic_x and a bank of known instances, we perform (1) temporal (blue) and (2) spatial (purple) augmentations. Temporal augmentation adds weak augmentation and inserts K𝐾Kitalic_K instances from the bank to create x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (the “before” frame). Another round of weak augmentations, combined with adding/moving/removing a subset of the K𝐾Kitalic_K added instances, produces x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (the “after” frame). Spatial augmentations adds strong augmentations to create x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. The three images are batched and fed into the model, where the backbone extracts features that are then encoded into instance embedding. The instance embedding from x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are used to compute embedsubscript𝑒𝑚𝑏𝑒𝑑\mathcal{L}_{embed}caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_b italic_e italic_d end_POSTSUBSCRIPT Eq. 4. The embedding from x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT serve as pseudo-labels against the embedding from x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in the self-supervised loss usubscriptu\mathcal{L}_{\mathrm{u}}caligraphic_L start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT Eq. 5.

The framework accommodates both supervised and semi-supervised data, comprising an instance segmentation model (Sec. 3.1) that is enhanced to extract learning signals from both instance-association and consistency losses (Sec. 3.2).

3.1 Instance Segmentation

Object instance segmentation begins with the input image x𝑥xitalic_x which first undergoes feature extraction by an extractor backbone. The features are then fed into an instance level embedding encoder that outputs 300300300300 tokens that serve the decoder, which emits instance embedding zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the predictions heads for class, bounding box and mask of object instance i𝑖iitalic_i. In this work we evaluate various established backbones: Resnet50, Resnet101 [14], and Swin-L [28] as options for the feature extractor. As embedding decoder we chose Deformable DETR [53] as a strong spatial decoder for its ability to learn object queries as features. The prediction heads for class labels and box coordinates are feed-forward networks (FFNs), whereas the mask prediction head is a Feature-Pyramid network (FPN) [25] that uses multi-scale features from the decoder’s last layers, followed by an FFN whose output mask is scaled up to match the original image size.

When the input is also accompanied by labels 𝐲𝐲\mathbf{y}bold_y, the supervised component of the loss constitutes a class label loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT; boxsubscript𝑏𝑜𝑥\mathcal{L}_{box}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT that combines L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss and generalized Intersection over Union (gIoU) loss [35]; masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT as the sum of the Dice loss [33] and Focal loss [26]:

s=cls+λ1box+λ2mask,subscriptssubscript𝑐𝑙𝑠subscript𝜆1subscript𝑏𝑜𝑥subscript𝜆2subscript𝑚𝑎𝑠𝑘\mathcal{L}_{\mathrm{s}}=\mathcal{L}_{cls}+\lambda_{1}\mathcal{L}_{box}+% \lambda_{2}\mathcal{L}_{mask},caligraphic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT , (1)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the loss coefficients.

Recent advancements in object detection incorporated optimal transport (OT) to address the optimal assignment between predictions and ground truth, as proposed by Ge et al. [9] and YOLOX [10]. Therefore, we compute the pairwise cost between predictions and ground truth instances, determining the optimal assignment for the top k𝑘kitalic_k predictions associated with each instance.

While operating on labeled data, object instances that are successfully segmented by the model are stored in a memory bank [11]. This object bank will be used in the self-supervised phase, during which labeled instances are randomly selected and augmented to emulate scene interaction.

3.2 Learning Through Interaction

Observing interactions shares commonality with Object Tracking, which goes beyond traditional object detection. It leverages discriminative representation of instances across frames and of different instances belonging to the same class. The resulting representation is more robust to occlusion and identity switches, as demonstrated in [22, 45].

Given an unlabeled input frame x𝑥xitalic_x containing an unknown number of object instances, we introduce a new augmentation strategy to create a pair of pre- and post-interaction frames. The first, pre-interaction frame x1=ψ1(x)subscript𝑥1subscript𝜓1𝑥x_{1}=\psi_{1}(x)italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) is an augmentation ψ1subscript𝜓1\psi_{1}italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of x𝑥xitalic_x where we stochastically insert K𝐾Kitalic_K objects from a bank of known instances. Each of the K𝐾Kitalic_K objects is also individually augmented (e.g., scale, position, rotation, flip, color). The second, post-interaction frame x2=ψ2(x1)subscript𝑥2subscript𝜓2subscript𝑥1x_{2}=\psi_{2}(x_{1})italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is an augmentation ψ2subscript𝜓2\psi_{2}italic_ψ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT where we also remove several of the objects added to x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT or insert a few more objects from the instance bank (with augmentations). Note that both x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are spatially augmented with rotation, crop, and scale to convey a sense of motion to the observer (inspired by SeqFormer [43]).
Augmentation Strategy Inserting new objects into a dense scene may lead to significant occlusions and even conceal the objects we intend to learn. Therefore, we devise a strategy that randomizes labeled objects from the instance bank and distributes them preferentially around the periphery of the frame:

(u,v)=Beta(α,β)[w,h]𝑢𝑣𝐵𝑒𝑡𝑎𝛼𝛽𝑤\displaystyle(u,v)=Beta(\alpha,\beta)\cdot[w,h]( italic_u , italic_v ) = italic_B italic_e italic_t italic_a ( italic_α , italic_β ) ⋅ [ italic_w , italic_h ] (2)

where (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is the top-left corner where the object is inserted, drawn from distribution Beta(α,β)2𝐵𝑒𝑡𝑎𝛼𝛽superscript2Beta(\alpha,\beta)\in\mathbb{R}^{2}italic_B italic_e italic_t italic_a ( italic_α , italic_β ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and w𝑤witalic_w, hhitalic_h are the feasible horizontal and vertical regions that ensure that the object is contained within the frame (see Appendix A.A). The choice of Beta𝐵𝑒𝑡𝑎Betaitalic_B italic_e italic_t italic_a and its parameters reduces the likelihood of objects inserted near the center, where they might obstruct unlabeled objects. Another safeguard prevents inserting an object if it would overlap with any of the previously inserted objects by more than 85%percent8585\%85 %.
Association Loss The resulting frames x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT contain a total of N𝑁Nitalic_N and M𝑀Mitalic_M object instances, respectively. Importantly, the objects’ small projective and illumination transformations compel the model to learn robust representations that maintain consistency for occurrences of the same instance in a changing scene (illustrated in  Fig. 2). Each embedding iN𝑖𝑁i\in Nitalic_i ∈ italic_N extracted from the first frame x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is matched against every embedding jM𝑗𝑀j\in Mitalic_j ∈ italic_M in the second frame x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, forming the association score f(i,j)𝑓𝑖𝑗f(i,j)italic_f ( italic_i , italic_j ) between instance i𝑖iitalic_i and instance j𝑗jitalic_j:

f(i,j)=12[exp(zjTzi)k=1Mexp(zkTzi)+exp(zjTzi)k=1Nexp(zjTzk)]𝑓𝑖𝑗12delimited-[]superscriptsubscript𝑧𝑗𝑇subscript𝑧𝑖superscriptsubscript𝑘1𝑀superscriptsubscript𝑧𝑘𝑇subscript𝑧𝑖superscriptsubscript𝑧𝑗𝑇subscript𝑧𝑖superscriptsubscript𝑘1𝑁superscriptsubscript𝑧𝑗𝑇subscript𝑧𝑘f(i,j)=\frac{1}{2}\left[\frac{\exp(z_{j}^{T}\cdot z_{i})}{\sum\limits_{k=1}^{M% }\exp(z_{k}^{T}\cdot z_{i})}+\frac{\exp(z_{j}^{T}\cdot z_{i})}{\sum\limits_{k=% 1}^{N}\exp(z_{j}^{T}\cdot z_{k})}\right]italic_f ( italic_i , italic_j ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG + divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG ] (3)

We consider the embedding of j^=argmaxf(i,j)^𝑗argmax𝑓𝑖𝑗\hat{j}=\mathrm{arg\,max}\;f(i,j)over^ start_ARG italic_j end_ARG = roman_arg roman_max italic_f ( italic_i , italic_j ) as a positive example for the given instance i𝑖iitalic_i if f(i,j^)>0.5𝑓𝑖^𝑗0.5f(i,\hat{j})>0.5italic_f ( italic_i , over^ start_ARG italic_j end_ARG ) > 0.5, otherwise it is considered a negative example. We employ an embedding contrastive loss [3] to learn object representation from the observed interaction frames:

embed=logexp(zizj+)exp(zizj+)+ziexp(zizj)subscript𝑒𝑚𝑏𝑒𝑑subscript𝑧𝑖superscriptsubscript𝑧𝑗subscript𝑧𝑖superscriptsubscript𝑧𝑗subscriptsuperscriptsubscript𝑧𝑖subscript𝑧𝑖superscriptsubscript𝑧𝑗\mathcal{L}_{embed}=-\log\frac{\exp(z_{i}\cdot z_{j}^{+})}{\exp(z_{i}\cdot z_{% j}^{+})+\sum_{z_{i}^{\!-}}\exp(z_{i}\cdot z_{j}^{\!-})}caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_b italic_e italic_d end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) end_ARG (4)

where zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the embedding of instance i𝑖iitalic_i in the first frame, zj+superscriptsubscript𝑧𝑗z_{j}^{+}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the embedding of the instance j𝑗jitalic_j in the second frame that ideally represents the same instance, and zjsuperscriptsubscript𝑧𝑗z_{j}^{\!-}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are the embedding of the remaining instances (negative views). The loss embedsubscript𝑒𝑚𝑏𝑒𝑑\mathcal{L}_{embed}caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_b italic_e italic_d end_POSTSUBSCRIPT pulls same-instance embedding closer together while pushing apart representations of different instances.

3.3 Self-Supervision

To better leverage spatial information in unlabeled data, we employ the segmentation model (Sec. 3.1) toward Semi-Supervised Learning (SSL). Inspired by [36], we include a consistency regularization loss and extend it to accept unlabeled images alongside labeled objects inserted from the instance bank.

Recall that x1=ψ1(x)subscript𝑥1subscript𝜓1𝑥x_{1}=\psi_{1}(x)italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) is a weak augmentation of the unlabeled input image x𝑥xitalic_x. In this context, we’ll denote xw=x1subscript𝑥𝑤subscript𝑥1x_{w}=x_{1}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We apply another round of weak augmentations to xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, followed by a strong augmentation ϕitalic-ϕ\phiitalic_ϕ to produce xs=x3=ϕ(x1)subscript𝑥𝑠subscript𝑥3italic-ϕsubscript𝑥1x_{s}=x_{3}=\phi(x_{1})italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_ϕ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). The strong augmentations comprise Color jitter, Planckian jitter [55], Gaussian blur, and gray-scale that are applied via RandAugment [6]. We feed both xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT into the model. Class labels, bounding boxes, and segmentation masks for weakly augmented inputs xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT are treated as pseudo-label targets (in the absence of ground truth) that are compared against the model’s prediction on xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The unsupervised consistency regularization loss:

u=^cls+λ1^box+λ2^masksubscriptusubscript^𝑐𝑙𝑠subscript𝜆1subscript^𝑏𝑜𝑥subscript𝜆2subscript^𝑚𝑎𝑠𝑘\mathcal{L}_{\mathrm{u}}=\hat{\mathcal{L}}_{cls}+\lambda_{1}\hat{\mathcal{L}}_% {box}+\lambda_{2}\hat{\mathcal{L}}_{mask}caligraphic_L start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT = over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT (5)

It is similar to the supervised loss ssubscripts\mathcal{L}_{\mathrm{s}}caligraphic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT (Eq. 1), with the distinction that pseudo-labels are used in place of ground-truth labels. Gradients are not computed during the forward pass of xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT (as illustrated in Fig. 2) as it constitutes the ground truth. We introduce the following refinements to stabilize the model during self-supervised training.
Refined Consistency Learning It is common practice to filter out pseudo-labels with low prediction scores in order to reduce the model’s exposure to errors during self-supervised training. The filters are often thresholds or quantiles that are either fixed, dynamic, or scheduled [40, 18]. Thresholds, by nature, are more restrictive, discarding all predictions below their stated value. However, during the early stages of self-supervision, the model may emit most of its predictions slightly below the threshold, resulting in very few labels contributing towards learning. On the other hand, quantiles ignore the scores entirely and allow any prediction, provided that its score meets the rank requirement of the quantile. Because most models output a fixed number of predictions to accommodate crowded scenes (regularly exceeding 300300300300 predictions), a quantile may become too lenient and include low-score predictions of poor quality, potentially degrading the model’s performance as training progresses.

In Appendix A.D, we demonstrate that early in training, setting the quantile too low lets in more predictions of low-quality signals, interfering with the model. Alternatively, setting the bar (too) high [36] risks missing out on meaningful supervision signals.

To reconcile the limitations of both thresholds and quantiles, we propose a cascade approach. First, a more relaxed class threshold γtclssuperscriptsubscript𝛾𝑡𝑐𝑙𝑠\gamma_{t}^{cls}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT removes instances whose class scores c^i𝐜^subscript^𝑐𝑖^𝐜\hat{c}_{i}\in\mathbf{\hat{c}}over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG bold_c end_ARG are deemed unusable, followed by a quantile selection Q𝑄Qitalic_Q [40] of the leading predictions. The resulting class pseudo-labels 𝐲^^𝐲\hat{\mathbf{y}}over^ start_ARG bold_y end_ARG is given by:

𝐲^=Q(𝐜^>γtcls;pt).^𝐲𝑄^𝐜subscriptsuperscript𝛾𝑐𝑙𝑠𝑡subscript𝑝𝑡\displaystyle\hat{\mathbf{y}}=Q(~{}\mathbf{\hat{c}}>\gamma^{cls}_{t}~{};~{}p_{% t}).over^ start_ARG bold_y end_ARG = italic_Q ( over^ start_ARG bold_c end_ARG > italic_γ start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (6)

The threshold γtclssuperscriptsubscript𝛾𝑡𝑐𝑙𝑠\gamma_{t}^{cls}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT discards instances with class scores c^isubscript^𝑐𝑖\hat{c}_{i}over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT below it and tightens over time (training steps t𝑡titalic_t). Conversely, the quantile Q(pt)𝑄subscript𝑝𝑡Q(p_{t})italic_Q ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) loosens over time with its probability pt=0.995(1t/T)subscript𝑝𝑡0.9951𝑡𝑇p_{t}=0.995\cdot(1-\nicefrac{{t}}{{T}})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.995 ⋅ ( 1 - / start_ARG italic_t end_ARG start_ARG italic_T end_ARG ) decays over subsequent training steps t𝑡titalic_t, with T𝑇Titalic_T denoting the total number of training steps. As a result, the quantile allows more predictions into the model as training progresses. This strategy can mitigate incorrect model beliefs and reduce confirmation biases. We evaluate this strategy quantitatively and demonstrate its advantage over thresholds and quantiles in Tab. 4, with additional details in Appendix A.C. We recognize that exploring different quantile strategies may further improve self-supervision and set it aside for future work.
Coupled Prediction Heads The standard approach to filtering pseudo-mask predictions employs a pixel-wise confidence threshold γtmasksuperscriptsubscript𝛾𝑡𝑚𝑎𝑠𝑘\gamma_{t}^{mask}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT that is applied to each pixel (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) of instance mask m^isubscript^𝑚𝑖\hat{m}_{i}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

m^iu,v={1if hmask(ziw)>γtmask,0otherwisesubscriptsuperscript^𝑚𝑢𝑣𝑖cases1if superscriptmasksubscriptsuperscript𝑧𝑤𝑖subscriptsuperscript𝛾mask𝑡0otherwise\displaystyle\hat{m}^{u,v}_{i}=\begin{cases}1&\text{if }h^{\text{mask}}(z^{w}_% {i})>\gamma^{\text{mask}}_{t},\\ 0&\text{otherwise}\end{cases}over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_u , italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_h start_POSTSUPERSCRIPT mask end_POSTSUPERSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_γ start_POSTSUPERSCRIPT mask end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (7)

where hmasksuperscript𝑚𝑎𝑠𝑘h^{mask}italic_h start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT is the mask head output for instance embedding ziwsuperscriptsubscript𝑧𝑖𝑤z_{i}^{w}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT obtained from the weakly augmented frame xwsuperscript𝑥𝑤x^{w}italic_x start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT.

Unlike masks, the prediction quality of bounding boxes is less correlated with high label scores. As such, recent SSL methods for object detection employ multiple passes to refine box predictions [49, 1]. Interestingly, we observe that the model learns to predict high quality masks well before it effectively predicts bounding boxes. Thus, we propose a coupling of the mask and box prediction heads so that during training, pseudo-boxes b^isubscript^𝑏𝑖\hat{b}_{i}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are obtained by bounding their corresponding instance segmentation masks m^isubscript^𝑚𝑖\hat{m}_{i}over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

b^i=[minum^iminvm^imaxum^imaxvm^i].subscript^𝑏𝑖matrixsubscript𝑢subscript^𝑚𝑖subscript𝑣subscript^𝑚𝑖subscript𝑢subscript^𝑚𝑖subscript𝑣subscript^𝑚𝑖\displaystyle\hat{b}_{i}=\begin{bmatrix}\displaystyle\min_{u}\hat{m}_{i}&% \displaystyle\min_{v}\hat{m}_{i}&\displaystyle\max_{u}\hat{m}_{i}&% \displaystyle\max_{v}\hat{m}_{i}\end{bmatrix}.over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL roman_min start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] . (8)

We refer to this simple yet effective technique for pseudo-box assignment as Mask-to-Box (M2B) and demonstrate its advantage over the standard approach to predicting pseudo-boxes in Tabs. 3 and 5.
Multi-Label Matching A common practice in object segmentation is to apply non-maximum suppression (NMS) to eliminate redundant predictions. In our case, instance overlaps are common since objects are inserted at random as part of pseudo-sequence generation. Therefore, to make better use of the model’s predictions during training, we introduce a new adaptation of Label-Matching (LM) [1], whereby we retain several overlap** predictions that coincide with the dominant class label (instead of discarding all but one). We call this method Multi-Label Matching (MLM) and conduct an ablation study to assess its contribution to self-supervision (Tabs. 3 and 5), demonstrating its advantage.

3.4 Unified Framework

The complete architecture of RISE is presented in Fig. 2, comprising an LTI branch and an SSL branch that converge into a unified loss:

total=𝟙[𝐲]s+λ3embed+𝟙[𝐲=]λ4u,subscripttotal1delimited-[]𝐲subscriptssubscript𝜆3subscriptembed1delimited-[]𝐲subscript𝜆4subscriptu\mathcal{L}_{\text{total}}=\mathds{1}[\mathbf{y}\neq\varnothing]\mathcal{L}_{% \mathrm{s}}+\lambda_{3}\mathcal{L}_{\mathrm{embed}}+\mathds{1}[\mathbf{y}=% \varnothing]\lambda_{4}\mathcal{L}_{\mathrm{u}},caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = blackboard_1 [ bold_y ≠ ∅ ] caligraphic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_embed end_POSTSUBSCRIPT + blackboard_1 [ bold_y = ∅ ] italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT , (9)

where 𝟙1\mathds{1}blackboard_1 indicates that the supervised loss ssubscripts\mathcal{L}_{\mathrm{s}}caligraphic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and unsupervised loss usubscriptu\mathcal{L}_{\mathrm{u}}caligraphic_L start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT are used according to the availability of ground-truth labels 𝐲𝐲\mathbf{y}bold_y, and embedsubscript𝑒𝑚𝑏𝑒𝑑\mathcal{L}_{embed}caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_b italic_e italic_d end_POSTSUBSCRIPT denotes the weighted combinations of the association loss. Hyperparameter search for λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and details on λ1,λ2subscript𝜆1subscript𝜆2\lambda_{1},\lambda_{2}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are provided in Appendix A.A.

4 Experiments

We conduct a series of experiments to evaluate the performance of the proposed approach in the Robotic Item Gras** domain. This domain is of high relevance to automated distribution warehouses, where robotic arms pick and place items inside totes. The experiments target a range of labeled data ratios, meaning that we intentionally restrict the model’s access to only a certain portion (%percent\%%) of the labeled samples, and treat the remaining samples as unlabeled.

4.1 Setup

Datasets Our main focus is the ARMBench [34] mix-tote benchmark comprising 44,2344423444,\!23444 , 234 images, split into 30,9923099230,\!99230 , 992 training images and 6,63766376,\!6376 , 637 and 6,60566056,\!6056 , 605 images for validation and testing, respectively. The images are not organized into sequences nor do they describe a localized action. Every object in the scene belongs to a single “object” category and is associated with a manually annotated instance mask.

The OCID [37] dataset (containing 2,39023902,\!3902 , 390 images and 31313131 classes) for various rates of labeled-to-unlabeled data, and compare it to the current state-of-the-art[32].

We use the same RISE configuration (e.g., Beta function, thresholds, etc.) for both datasets. The results in Tab. 2 illustrate that our method is readily applied to new datasets without requiring domain-specific configuration adjustments.
Evaluation We evaluate our method using the standard Average Precision (AP). We measure the overall AP across 10101010 IoU thresholds [0.50,,0.95]0.500.95[0.50,\dots,0.95][ 0.50 , … , 0.95 ], as well as the IoU thresholded precision AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and AP75subscriptAP75\text{AP}_{75}AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT. For OCID we use only the AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT to be consistent with prior art.

In terms of partitioning, we use 100%percent100100\%100 %, 10%percent1010\%10 %, 2%percent22\%2 %, 1%percent11\%1 % and 0.5%percent0.50.5\%0.5 % of the data as fully annotated, and the remaining as unlabeled for ARMBench, and 100%percent100100\%100 %, 10%percent1010\%10 % and 5%percent55\%5 % for OCID. We compare RISE with the officially reported performance from the ARMBench [34] and RoboLLM [29], in which a model was trained on the entire training set. This baseline is the existing state-of-the-art on the ARMBench dataset. In addition, we compare RISE with Deformable DETR [56] (denoted DeDETR).

4.2 Results

Tab. 1 shows the results of RISE on various data partitions of labeled/unlabeled ratios of the ARMBench, compared with Deformable DETR and SAM[19] (fine-tuned), as well as the results reported by the authors of ARMBench [34]. Both DeDETR and RISE use Swin-L[28] (197197197197M parameters) as backbone, while SAM uses ViT-H[23] (636636636636M parameters) and RoboLLM[29] uses Beit-3 base (87M parameters). Across all partitions, RISE outperforms the other methods. Fig. 4 illustrates high-quality masks predicted by RISE trained on 1%percent11\%1 % of the labeled, with 99%percent9999\%99 % of the remaining data treated as unlabeled. Most of the line-of-sight objects are accurately segmented and a few heavily occluded objects are missed. Importantly, RISE trained on 1%percent11\%1 % annotated data performs better than DeDETR and SAM trained on 10%percent1010\%10 % annotated samples (10×10\times10 × the amount of annotations for training/fine-tuning).

Table 1: ARMBench [34] mix-object-tote instance segmentation with subset of annotations. The first column denote the number of annotated samples, with the rest treated as unlabeled data for self-supervision. Second column details the method name and the next columns are AP measures, with best performers marked in bold.
%percent\%% ResNet-50 ResNet-101 ViT
Labeled Method AP AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\text{AP}_{75}AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT AP AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\text{AP}_{75}AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT AP AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\text{AP}_{75}AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
0.5%percent0.50.5\%0.5 % DeDETR[53] 27.03 29.32 26.65 28.36 30.69 28.14 36.47 36.46 31.75
(155)155(155)( 155 ) M2F [5] 55.3 59.9 54.7
SAM [20] 61.38 74.04 63.51
RISE 66.15 78.80 69.67 71.40 82.10 72.30 72.14 83.25 73.73
1%percent11\%1 % DeDETR [53] 27.17 29.64 26.67 30.38 34.52 29.73 39.46 39.44 33.51
(309)309(309)( 309 ) YOLACT 36.1 59.2 44.8
M2F [5] 58.6 64.7 58.6
SAM [20] 67.42 82.26 70.93
RISE 69.10 82.10 73.80 73.00 83.25 73.94 73.72 84.89 74.89
2%percent22\%2 % DeDETR[53] 31.79 36.5 31.81 33.14 39.70 34.19 42.15 66.20 43.39
M2F [5] 61.4 68.5 61.2
(618)618(618)( 618 ) RISE 72.80 82.90 74.40 73.66 83.44 75.89 73.92 84.00 76.25
10%percent1010\%10 % DeDETR [53] 48.00 57.44 48.17 52.23 59.98 49.8 59.19 75.5 60.42
(3,099)3099(3,\!099)( 3 , 099 ) YOLACT 47.40 68.20 52.70
M2F [5] 68.2 76.5 68.2
SAM [20] 71.47 82.78 73.96
RISE 73.39 83.48 75.09 74.27 84.33 75.54 74.95 85.16 76.26
100%percent100100\%100 % ARMBench [34] - 72.00 61.00 - - - - - -
(30,992)30992(30,\!992)( 30 , 992 ) DeDETR [53] 52.11 60.38 52.52 53.80 62.00 52.80 62.75 77.03 63.40
M2F [5] 73.00 81.2 74.00
RoboLLM [29] 82.0 67.00
RISE 73.41 83.53 75.15 74.47 84.74 75.93 76.04 86.37 77.51

Tab. 2 compares performance with partitions of labeled/unlabeled data ratios from OCID, showing the advantage of RISE.

Table 2: Evaluation on OCID[37] (RGB only). + denotes Second stage networks. Bottom rows show the performance when only a portion (%percent\%%) of annotations are used. The proposed approach achieves significantly better results even with few annotations compared to prior art.
Method AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT
UCN [46] 54.8
UCN++ [46] 59.1
Mask2Former [5] 67.2
MSMFormer[31] 72.9
MSMFormer+ [31] 73.9
MRCNN [13] 77.6
RISE 78.2
RISE (5%percent55\%5 %) 75.1
RISE (10%percent1010\%10 %) 77.3
Refer to caption
Figure 3: Fine-tuned SAM (1%percent11\%1 % ARMBench data) showing many fragmented masks and false positive artifacts.
Refer to caption
Figure 4: Qualitative results of RISE using ResNet-101 backbone, trained on 1%percent11\%1 % of the labeled data (99%percent9999\%99 % treated as unlabeled). Comparing the ground truth (center column) and the predicted masks (right-most column), we see that the majority of large items are accurately segmented, while some of the smaller or heavily occluded objects are occasionally missed. The segmentation masks are continuous, indicating high confidence for every object. Mask boundaries are the only regions with instance–background ambiguity (evident by mild noise at object boundaries).

Foundation Model Comparison
As an additional baseline we compare RISE with the “Segment Anything” (SAM) foundation model [19], fine-tuned on a subset of the ARMBench dataset. In Tab. 1 we demonstrate that despite SAM’s unrivalled ability to segment anything, it is prone to over-segment and produce mask artifacts, even after fine-tuning on a small portion of domain-specific images. Fig. 3 shows an example where SAM, fine-tuned on 1%percent11\%1 % of the data still struggles with accurately discerning objects, resulting in fragmented and incomplete object masks and mask predictions that target less significant elements of the image (such as packaging features, rivets and shadows). The numerous false positive predictions impact the overall performance.

4.3 Ablation Study

We provide an ablation study of the various design choices made in implementing RISE: impact of losses, pseudo-label threshold strategies and parameters. Tab. 3 details the contribution of the different elements within RISE on the fully-supervised training set. The most substantial improvement is attributed to the Pseudo-Sequence (PS) strategy outlined in Sec. 3.2. The coupling of prediction heads in Mask-to-Box (M2B in Eq. 7) refines the supervision signal for box predictions, further improving the performance. Combining it with Multi-Label Matching (MLM) and Optimal Transport (OT) yields the best performing version of RISE.

Table 3: Ablation study on different components of RISE. Pseudo-Sequences (PS), Optimal Transport (OT), Mask-to-Box coupling (M2B) and Multi-Label Matching (MLM). We evaluate the contribution of these component using 100%percent100100\%100 % of the annotated data, showing that the combined approach achieves the best results.
PS OT M2B MLM AP
53.80
73.85
62.92
74.16
74.35
74.47
Table 4: Ablation study of pseudo-label threshold strategy, comparing the score threshold γtclssuperscriptsubscript𝛾𝑡𝑐𝑙𝑠\gamma_{t}^{cls}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT, the quantile Q(pt)𝑄subscript𝑝𝑡Q(p_{t})italic_Q ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the proposed score filtering cascade (Eq. 6) that applies a threshold followed by a quantile γtclsQ(pt)superscriptsubscript𝛾𝑡𝑐𝑙𝑠𝑄subscript𝑝𝑡\gamma_{t}^{cls}\rightarrow Q(p_{t})italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT → italic_Q ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (or reversed Q(pt)γtcls𝑄subscript𝑝𝑡superscriptsubscript𝛾𝑡𝑐𝑙𝑠Q(p_{t})\rightarrow\gamma_{t}^{cls}italic_Q ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) → italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT). The best performance is achieved for the proposed cascade approach.
Threshold Strategy AP AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\text{AP}_{75}AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
Threshold only γtclssuperscriptsubscript𝛾𝑡𝑐𝑙𝑠\gamma_{t}^{cls}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT 72.91 83.39 74.38
Quantile only Q(pt)𝑄subscript𝑝𝑡Q(p_{t})italic_Q ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 72.59 83.2 74.44
Cascade γtclssuperscriptsubscript𝛾𝑡𝑐𝑙𝑠\gamma_{t}^{cls}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT    \rightarrow Q(pt)𝑄subscript𝑝𝑡Q(p_{t})italic_Q ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 74.47 84.33 75.93
Cascade Q(pt)𝑄subscript𝑝𝑡Q(p_{t})italic_Q ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \rightarrow γtclssuperscriptsubscript𝛾𝑡𝑐𝑙𝑠\gamma_{t}^{cls}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT 73.0 82.75 74.55

Next we evaluate the pseudo-label elimination strategy of either a standard score threshold or quantile function, compared with the proposed cascade approach (Eq. 6). Notably, setting the threshold or quantile too low would include more false positive predictions in training. Setting them too high would eliminate correct predictions since very few predictions would meet the required prediction score. This holds even when both threshold and quantile are dynamic (changing via predefined schedule). Tab. 4 shows that the cascade approach which first enforces a lenient threshold and then a quantile yields the best results.

In Fig. 5 we illustrate that setting the threshold too high at any point during training would eliminate many true-positive predictions, solely due to the model predicting a low class label score. This is more prominent early in training, since the model gains confidence in its predictions as training progresses. We visualize the “ideal” threshold that would only retain true-positives and ensure that no false-positive pixels are passed through.

Refer to caption
(a) Label accuracy over time The solid-black line represent the “ideal” class threshold γclssuperscript𝛾𝑐𝑙𝑠\gamma^{cls}italic_γ start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT that would eliminate all false-positive predictions and retain only true-positives. The regions highlighted in gray denote the standard-deviation of the ideal threshold.
Refer to caption
(b) Mask accuracy over time The solid-black line describes the mask threshold γmasksuperscript𝛾𝑚𝑎𝑠𝑘\gamma^{mask}italic_γ start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT that for each instance, distinguish between pseudo-masks and background. The gray-highlighted regions denote the standard-deviation .
Figure 5: Label and Mask Accuracy of pseudo-labels. For both he x𝑥xitalic_x-axis measures training steps in multiples of ×1000absent1000\times 1000× 1000.
Table 5: Ablation study of mask threshold γmasksuperscript𝛾𝑚𝑎𝑠𝑘\gamma^{mask}italic_γ start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT, class threshold γclssuperscript𝛾𝑐𝑙𝑠\gamma^{cls}italic_γ start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT and pseudo-box strategies, showing AP for 1% annotated data. The Pseudo-box column corresponds to the standard pseudo-box approach which discards box predictions corresponding to pseudo-labels below the threshold γclssuperscript𝛾𝑐𝑙𝑠\gamma^{cls}italic_γ start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT. The Mask-to-Box (M2B) and the combined Mask-to-Box with Multi-Label-Matching (M2B+MLM), introduced in this work, extract pseudo-boxes from pseudo-masks and instead filter individual pixels whose score fall below γmasksuperscript𝛾𝑚𝑎𝑠𝑘\gamma^{mask}italic_γ start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT. The table shows that M2B+MLM produced the best results.
γclssuperscript𝛾𝑐𝑙𝑠\gamma^{cls}italic_γ start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT or γmasksuperscript𝛾𝑚𝑎𝑠𝑘\gamma^{mask}italic_γ start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT Pseudo-box M2B M2B + MLM
0.5 71.98 72.71 73.00
0.6 71.70 72.55 72.87
0.7 71.57 72.28 72.86

Finally, we measure the impact of the different pseudo-box strategies. Tab. 5 shows the resulting AP for RISE trained on 1%percent11\%1 % of the labeled data and various values of threshold γmasksuperscript𝛾𝑚𝑎𝑠𝑘\gamma^{mask}italic_γ start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT. The standard approach filters the boxes by thresholding the class prediction score using γclssuperscript𝛾𝑐𝑙𝑠\gamma^{cls}italic_γ start_POSTSUPERSCRIPT italic_c italic_l italic_s end_POSTSUPERSCRIPT. We denote by M2B the effect of Mask-to-Box (extracting bounding boxes from the predicted instance masks), and denote by MLM the use of multiple boxes towards the self-supervised loss usubscriptu\mathcal{L}_{\mathrm{u}}caligraphic_L start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT in Eq. 5. The results demonstrate that employing a mask threshold of γmask=0.5superscript𝛾𝑚𝑎𝑠𝑘0.5\gamma^{mask}=0.5italic_γ start_POSTSUPERSCRIPT italic_m italic_a italic_s italic_k end_POSTSUPERSCRIPT = 0.5 in conjunction with M2B + MLM achieves the best performance.

5 Conclusion

In this work, we present RISE, a novel framework that incorporates semi-supervised learning with learning through scene interaction in the context of a few-annotation data regime. RISE is modular and can complement other segmentation models that emit intermediate instance embedding. We demonstrate that RISE improves AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT by over +1010+10+ 10 compared to previous state-of-the-art, after training end-to-end on just 0.5%percent0.50.5\%0.5 % of the labeled data (with 99.5%percent99.599.5\%99.5 % of the data treated as unlabeled). With just 1%percent11\%1 % of the labeled data, RISE achieves better performance than the baselines (DeDETR, SAM, RoboLLM) trained on 10×10\times10 × the amount of labeled data. On OCID (RGB), RISE sets a new state-of-the-art, and is near state-of-the-art when restricted to a fraction of the annotations. For future work, we intend on leveraging “before” and “after” observations directly using robotic item gras** in real-world environments (rather than synthetically inserting instances into images), with the overarching goal of lifelong learning for robot perception.

A limitation of the proposed approach is that it underperforms when presented with objects that make few or no appearances in the truncated (labeled) training data. Access to a handful of annotated examples means that not all objects are encountered during training, resulting in some cases where two objects are segmented as one. In the context of robotic gras**, this may lead to a failed object-gras** attempt. However, since grasp failures also alter the scene, we believe that capturing snapshots of the scene before and after the attempted interaction would help improve the gras** precision in the long term.

References

  • [1] Chen, B., Chen, W., Yang, S., Xuan, Y., JieSong, Xie, D., Pu, S., Song, M., Zhuang., Y.: Label matching semi-supervised object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
  • [2] Chen, H., Venkatesh, R., Friedman, Y., Wu, J., Tenenbaum, J.B., Yamins, D.L., Bear, D.M.: Unsupervised segmentation in real-world images via spelke object inference. In: European Conference on Computer Vision. pp. 719–735. Springer (2022)
  • [3] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (Jul 2020), https://proceedings.mlr.press/v119/chen20j.html
  • [4] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 22243–22255. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html
  • [5] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation (2022)
  • [6] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.: RandAugment: practical automated data augmentation with a reduced search space. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 18613–18624. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/hash/d85b63ef0ccb114d0a3bb7b7d808028f-Abstract.html
  • [7] Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: Surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE international conference on computer vision. pp. 1301–1310 (2017)
  • [8] Garg, S., Sunderhauf, N., Dayoub, F., Morrison, D., Cosgun, A., Carneiro, G., Wu, Q., Chin, T.J., Reid, I.D., Gould, S., Corke, P., Milford, M.: Semantics for robotic map**, perception and interaction: A survey. ArXiv abs/2101.00443 (2021)
  • [9] Ge, Z., Liu, S., Li, Z., Yoshie, O., Sun, J.: Ota: Optimal transport assignment for object detection. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 303–312 (2021)
  • [10] Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
  • [11] Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B.: Simple copy-paste is a strong data augmentation method for instance segmentation. arXiv preprint arXiv:2012.07177 (2020)
  • [12] Grad, E., Kimhi, M., Halika, L., Baskin, C.: Benchmarking label noise in instance segmentation: Spatial noise matters (2024)
  • [13] He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. 2017 IEEE International Conference on Computer Vision (ICCV) pp. 2980–2988 (2017)
  • [14] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun 2016), https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html
  • [15] Horváth, D., Bocsi, K., Erdös, G., Istenes, Z.: Sim2real grasp pose estimation for adaptive robotic applications. ArXiv abs/2211.01048 (2022)
  • [16] Ke, L., Danelljan, M., Ding, H., Tai, Y.W., Tang, C.K., Yu, F.: Mask-free video instance segmentation. In: CVPR (2023)
  • [17] Kim, B., Choo, J., Kwon, Y.D., Joe, S., Min, S., Gwon, Y.: SelfMatch: combining contrastive self-supervision and consistency for semi-supervised learning. arXiv preprint (Jan 2021), https://arxiv.longhoe.net/abs/2101.06480
  • [18] Kimhi, M., Kimhi, S., Zheltonozhskii, E., Litany, O., Baskin, C.: Semi-supervised semantic segmentation via marginal contextual information (2023)
  • [19] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)
  • [20] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
  • [21] Kroemer, O., Niekum, S., Konidaris, G.D.: A review of robot learning for manipulation: Challenges, representations, and algorithms. J. Mach. Learn. Res. 22, 30:1–30:82 (2019)
  • [22] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). vol. 2, pp. 2169–2178 (2006). https://doi.org/10.1109/CVPR.2006.68
  • [23] Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. In: European Conference on Computer Vision. pp. 280–296. Springer (2022)
  • [24] Li, Y., Zhang, M., Grotz, M., Mo, K., Fox, D.: Stow: Discrete-frame segmentation and tracking of unseen objects for warehouse picking robots. In: Conference on Robot Learning, CoRL 2023 (2023), https://openreview.net/pdf?id=48qUHKUEdBf
  • [25] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 936–944 (2017). https://doi.org/10.1109/CVPR.2017.106
  • [26] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: IEEE International Conference on Computer Vision (Oct 2017), https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html
  • [27] Liu, Y., Chen, X., Abbeel, P.: Self-supervised instance segmentation by gras**. ArXiv abs/2305.06305 (2023)
  • [28] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
  • [29] Long, Z., Killick, G., McCreadie, R., Camarasa, G.A.: Robollm: Robotic vision tasks grounded on multimodal large language models (2023)
  • [30] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
  • [31] Lu, Y., Chen, Y., Ruozzi, N., Xiang, Y.: Mean shift mask transformer for unseen object instance segmentation. ArXiv abs/2211.11679 (2022)
  • [32] Lu, Y., Chen, Y., Ruozzi, N., Xiang, Y.: Mean shift mask transformer for unseen object instance segmentation (2022). https://doi.org/10.48550/ARXIV.2211.11679
  • [33] Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). pp. 565–571. Ieee (2016)
  • [34] Mitash, C., Wang, F., Lu, S., Terhuja, V., Garaas, T.W., Polido, F., Nambi, M.: Armbench: An object-centric benchmark dataset for robotic manipulation. 2023 IEEE International Conference on Robotics and Automation (ICRA) pp. 9132–9139 (2023)
  • [35] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 658–666 (2019)
  • [36] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: FixMatch: simplifying semi-supervised learning with consistency and confidence. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 596–608. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/hash/06964dce9addb1c5cb5d6e3d9838f733-Abstract.html
  • [37] Suchi, M., Patten, T., Vincze, M.: Easylabel: A semi-automatic pixel-wise object annotation tool for creating robotic rgb-d datasets. 2019 International Conference on Robotics and Automation (ICRA) pp. 6678–6684 (2019), https://api.semanticscholar.org/CorpusID:59604455
  • [38] Tang, C., Huang, D., Ge, W., Liu, W., Zhang, H.: Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented gras**. arXiv preprint arXiv:2307.13204 (2023)
  • [39] Wang, Y., Chen, H., Heng, Q., Hou, W., Fan, Y., Wu, Z., Wang, J., Savvides, M., Shinozaki, T., Raj, B., Schiele, B., Xie, X.: FreeMatch: self-adaptive thresholding for semi-supervised learning. In: International Conference on Learning Representations (2023), https://openreview.net/forum?id=PDrUPTXJI_A
  • [40] Wang, Y., Wang, H., Shen, Y., Fei, J., Li, W., **, G., Wu, L., Zhao, R., Le, X.: Semi-supervised semantic segmentation using unreliable pseudo labels. In: IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) (2022), https://openaccess.thecvf.com/content/CVPR2022/html/Wang_Semi-Supervised_Semantic_Segmentation_Using_Unreliable_Pseudo-Labels_CVPR_2022_paper.html
  • [41] Wen, B., Lian, W., Bekris, K., Schaal, S.: Catgrasp: Learning category-level task-relevant gras** in clutter from simulation. ICRA 2022 (2022)
  • [42] Wen, H., Yan, J., Peng, W., Sun, Y.: Transgrasp: Grasp pose estimation of a category of objects by transferring grasps from only one labeled instance. ArXiv abs/2207.07861 (2022)
  • [43] Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: Sequential transformer for video instance segmentation. In: ECCV (2022)
  • [44] Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)
  • [45] Wu, Y., Yu, T., Hua, G.: Tracking appearances with occlusions. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. vol. 1, pp. I–I (2003). https://doi.org/10.1109/CVPR.2003.1211433
  • [46] Xiang, Y., Xie, C., Mousavian, A., Fox, D.: Learning rgb-d feature embeddings for unseen object instance segmentation. In: Conference on Robot Learning (2020)
  • [47] Xie, C., Xiang, Y., Mousavian, A., Fox, D.: Unseen object instance segmentation for robotic environments. IEEE Transactions on Robotics (T-RO) (2021)
  • [48] Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.V.: Unsupervised data augmentation for consistency training. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 6256–6268. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/hash/44feb0096faa8326192570788b38c1d1-Abstract.html
  • [49] Xu, M., Zhang, Z., Hu, H., Wang, J., Wang, L., Wei, F., Bai, X., Liu, Z.: End-to-end semi-supervised object detection with soft teacher. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3060–3069 (Oct 2021), https://openaccess.thecvf.com/content/ICCV2021/html/Xu_End-to-End_Semi-Supervised_Object_Detection_With_Soft_Teacher_ICCV_2021_paper.html
  • [50] Yang, L., Qi, L., Feng, L., Zhang, W., Shi, Y.: Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In: CVPR (2023)
  • [51] Ying, K., Zhong, Q., Mao, W., Wang, Z., Chen, H., Wu, L.Y., Liu, Y., Fan, C., Zhuge, Y., Shen, C.: CTVIS: Consistent Training for Online Video Instance Segmentation (2023)
  • [52] Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., Shinozaki, T.: Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems. vol. 34, pp. 18408–18419. Curran Associates, Inc. (2021), https://proceedings.neurips.cc/paper/2021/hash/995693c15f439e3d189b06e89d145dd5-Abstract.html
  • [53] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
  • [54] Ziegler, A., Asano, Y.M.: Self-supervised learning of object parts for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14502–14511 (2022)
  • [55] Zini, S., Buzzelli, M., Twardowski, B., van de Weijer, J.: Planckian jitter: enhancing the color quality of self - supervised visual representations. arXiv preprint arXiv: 2202.07993 (2022)
  • [56] Zong, Z., Song, G., Liu, Y.: DETRs with collaborative hybrid assignments training. arXiv preprint (Nov 2022), https://arxiv.longhoe.net/abs/2211.12860
  • [57] Zoph, B., Ghiasi, G., Lin, T.Y., Cui, Y., Liu, H., Cubuk, E.D., Le, Q.V.: Rethinking pre-training and self-training. ArXiv abs/2006.06882 (2020)

Supplementary Materials for Robot Instance Segmentation with Few Annotations for Gras**

Appendix A.A Technical details

Model The RISE framework begins with an image augmentation step that feeds into a feature extractor followed by an instance segmentation model, and ends at prediction heads for class, bounding box, mask and instance association. We use ResNet-50, ResNet-101 [14] and Swin-L transformer  [28] as backbones throughout our experiments, followed by Deformable DETR  [53] with 6666 encoders and decoders, width of 256256256256 and 300300300300 fixed instance queries, converging on an FPN-like dynamic mask head (as in SeqFormer [43]). In our evaluation, we measure the contribution of the proposed approach to Deformable DETR which serves baseline, and all feature extractors are pretrained on COCO instance segmentation, as is common in Instance segmentation pretraining  [57]. The proposed method incorporates a contrastive head (inspired by IDOL [44]) and introduces instance bank, self-supervision branch for non-labeled data, coupled prediction heads for stability (M2B) and label matching strategy during training (MLM). These, in aggregate, allow RISE to outperform both Deformable DETR and SAM, even when these are trained on ×10absent10\times 10× 10 more data (1%percent11\%1 % vs 10%percent1010\%10 %).

Hyperparameters Recall from Sec. 3.1 and Sec. 3.3 that the supervised loss ssubscripts\mathcal{L}_{\mathrm{s}}caligraphic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and unsupervised loss usubscriptu\mathcal{L}_{\mathrm{u}}caligraphic_L start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT (Eq. 1, Eq. 5, respectively) are a combination of the class loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, bounding-box loss boxsubscript𝑏𝑜𝑥\mathcal{L}_{box}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT weighted by λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the mask loss masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT weighted by λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We set the loss weights to be λ1=2.0subscript𝜆12.0\lambda_{1}=2.0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2.0, λ2=1.0subscript𝜆21.0\lambda_{2}=1.0italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.0. The total loss totalsubscripttotal\mathcal{L}_{\mathrm{total}}caligraphic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT (in Eq. 9) combines the supervised loss ssubscripts\mathcal{L}_{\mathrm{s}}caligraphic_L start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT or unsupervised loss usubscriptu\mathcal{L}_{\mathrm{u}}caligraphic_L start_POSTSUBSCRIPT roman_u end_POSTSUBSCRIPT (depending on availability of label 𝐲𝐲\mathbf{y}bold_y), with association loss embedsubscriptembed\mathcal{L}_{\mathrm{embed}}caligraphic_L start_POSTSUBSCRIPT roman_embed end_POSTSUBSCRIPT weighted by λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Tab. 6 details an ablation of values of λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, showing that the embedsubscriptembed\mathcal{L}_{\mathrm{embed}}caligraphic_L start_POSTSUBSCRIPT roman_embed end_POSTSUBSCRIPT contributes to performance, with the best results attained forλ3=0.05subscript𝜆30.05\lambda_{3}=0.05italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0.05.

Table 6: Ablation of weight λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT applied to the sequence association loss embedsubscriptembed\mathcal{L}_{\mathrm{embed}}caligraphic_L start_POSTSUBSCRIPT roman_embed end_POSTSUBSCRIPT described in Eq. 4. This evaluation uses 10%percent1010\%10 % of ARMBench labels (90%percent9090\%90 % treated as unlabeled data) and the Swin-L as backbone. A value of λ3=0subscript𝜆30\lambda_{3}=0italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0 corresponds to a variant that ignores embedsubscript𝑒𝑚𝑏𝑒𝑑\mathcal{L}_{embed}caligraphic_L start_POSTSUBSCRIPT italic_e italic_m italic_b italic_e italic_d end_POSTSUBSCRIPT. The best performance is obtained for λ3=[0.05,0.1]subscript𝜆30.050.1\lambda_{3}=[0.05,0.1]italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = [ 0.05 , 0.1 ].
λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT AP AP50subscriptAP50\text{AP}_{50}AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\text{AP}_{75}AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
0 74.7 84.9 75.9
0.02 74.4 84.5 75.6
0.05 74.9 85.2 76.0
0.1 74.5 83.8 76.2
0.5 74.3 84.2 75.3

Augmentation Strategy The input images are downsampled and randomly cropped so that the longest side is at most 600600600600 pixels, and so that the shortest side is at least 480480480480 pixels. Recall from Sec. 3.2 that the instance bank contains object cutouts from the labeled portion of the dataset, inspired by [7]. We randomly insert K𝐾Kitalic_K instances from the instance bank into the image to produce the “before” image x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and apply weak augmentations (e.g. slight rotation, translation, brightness etc.). However, we depart from previous approaches by having these K𝐾Kitalic_K instances distributed according to Beta(α=0.5,β=0.5)𝐵𝑒𝑡𝑎formulae-sequence𝛼0.5𝛽0.5Beta(\alpha=0.5,\beta=0.5)italic_B italic_e italic_t italic_a ( italic_α = 0.5 , italic_β = 0.5 ), depicted in Fig. 6, making it less likely for synthetically placed instances to occlude actual objects in the image. In addition, we also ensure that the K𝐾Kitalic_K inserted instances don’t overlap with one another beyond 85%percent8585\%85 % since they form ground truth labels during self-supervision. We then generate the “after” image x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by randomly adding more instances from the instance bank, or alternatively removing (or transforming) already inserted instances, followed by another round of weak augmentations. The before and after frames serve toward learning through interaction, and we facilitate self-supervised learning by strongly augmenting x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to yield x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and treat xw=x1subscript𝑥𝑤subscript𝑥1x_{w}=x_{1}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and xs=x3subscript𝑥𝑠subscript𝑥3x_{s}=x_{3}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as an input a pair of weakly- and strongly-augmented images. We employ this approach in our evaluation of both ARMBench [34] and OCID [37] without the needing to tune its parameters specific datasets.

Refer to caption
Figure 6: Two dimensional independent Beta(α=0.5,β=0.5)𝐵𝑒𝑡𝑎formulae-sequence𝛼0.5𝛽0.5Beta(\alpha=0.5,\beta=0.5)italic_B italic_e italic_t italic_a ( italic_α = 0.5 , italic_β = 0.5 ) distribution representing the spread of instance-bank objects inserted into unlabeled images. The distribution favors placing inserted objects at the periphery of the image, since most images contain most of their information about their center (bright regions denote low probability).

Training. We train our model for 12,0001200012,00012 , 000 iterations, using AdamW [30] optimizer with learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and weight decay of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and lr scheduler of StepLR that steps down an order of magnitude after 8,00080008,0008 , 000 iterations.

Appendix A.B Prediction Matching

The model predicts up to 300300300300 instance labels, boxes and masks which are often far beyond the actual instance count in a given image. In order to compute the loss between valid predictions and ground-truth annotations, we compute the bipartite cost matrix which measures the IoU of each prediction against each ground-truth annotation (either based on box IoU or using the Mask-to-Box method detailed in Sec. 3.3). We then find the fitting assignment for each ground-truth annotation by solving an Optimal Transport (OT) Problem [9]. A similar approach described in Sec. 3.2 serves toward computing embedsubscriptembed\mathcal{L}_{\mathrm{embed}}caligraphic_L start_POSTSUBSCRIPT roman_embed end_POSTSUBSCRIPT which requires positive and negative views of an instance. We introduce a method inspired by IDOL [44], where the top-10101010 prediction matches of each ground-truth annotation are treated as positive views and the rest are considered negative views. The impact of matching is evident in the ablation study in Tab. 3 where we use either OT or a more standard approach of using the top 0.70.70.70.7 IoU as positive and bottom 0.30.30.30.3 IoU as negative.

This flow is similarly applied during the self-supervision phase, with the distinction of using pseudo- labels, boxes and masks instead of manually annotated ground truth. Here we also employ Multi-Label Matching (MLM, Sec. 3.3) to allow the model to learn from multiple pseudo-labels predicted from the weak augmentation xwsubscript𝑥𝑤x_{w}italic_x start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. The impact of MLM is demonstrated in Tab. 3 and inspired by [1], where it further contributes to the framework’s performance.

Appendix A.C Thresholds

We use time-dependent thresholds [52, 17], whereby an initial threshold value increases every 1000100010001000 training steps. The class and mask thresholds start at γtcls=γtmask=0.85superscriptsubscript𝛾𝑡clssuperscriptsubscript𝛾𝑡mask0.85\gamma_{t}^{\text{cls}}=\gamma_{t}^{\text{mask}}=0.85italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cls end_POSTSUPERSCRIPT = italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mask end_POSTSUPERSCRIPT = 0.85 and peak at 0.980.980.980.98. For the Cascade approach (Sec. 3.3) which combines a lenient threshold followed by a quantile Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT described in Eq. 6. We set the initial class and mask thresholds to be γtcls=γtmask=0.5superscriptsubscript𝛾𝑡clssuperscriptsubscript𝛾𝑡mask0.5\gamma_{t}^{\text{cls}}=\gamma_{t}^{\text{mask}}=0.5italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cls end_POSTSUPERSCRIPT = italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT mask end_POSTSUPERSCRIPT = 0.5 and peak at 0.850.850.850.85. The quantile Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT follows the schedule pt=a0(1t/T)subscript𝑝𝑡subscript𝑎01𝑡𝑇p_{t}=a_{0}\cdot(1-\nicefrac{{t}}{{T}})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ ( 1 - / start_ARG italic_t end_ARG start_ARG italic_T end_ARG ) where t𝑡titalic_t is the training step, T𝑇Titalic_T denotes the total number of training steps, and a0=0.995subscript𝑎00.995a_{0}=0.995italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.995 is the quantile base value. Upon ranking the model’s predicted instances by their class score, only the top ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are retained, and the rest are discarded.

Appendix A.D Study of Cascade Threshold

We study the behavior of pseudo-label and pseudo-mask Cascade filter strategy (Eq. 6). We evaluate the per-instance prediction score of the model using different base values for the quantile Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the Cascade threshold. In Fig. 7, each color band represent 1000100010001000 iterations. The figure shows that setting the base value of the quantile too low would allow in more false-negatives as pseudo- labels and masks. Alternatively, setting it too high would discard valuable predictions as they don’t meet the ranking requirement of the qunatile. Following this evaluation we set the quantile base value to a0=0.995subscript𝑎00.995a_{0}=0.995italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.995, which leads to the most balanced behavior of discarding false-positive predictions while allowing through true-positives (even when their score would be considered too low by a standard scheduled threshold).

Refer to caption
(a) a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.7
Refer to caption
(b) a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.9
Refer to caption
(c) a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.95
Refer to caption
(d) a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.99
Refer to caption
(e) a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.995
Figure 7: Confidence density over time using different quantile values for the cascade threshold (Eq. 6). The x𝑥xitalic_x-axis represent the score of all samples, the y𝑦yitalic_y-axis the valid instance-count (instance density), and the color band correspond to training iterations in increments of 1000100010001000. The cascade threshold applies both a time-dependent threshold (which tightens over time) and a time-dependent quantile Qtsubscript𝑄𝑡Q_{t}italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (which loosens over time). The base quantile value a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is detailed for each subfigure, showing that the best initial value for the quantile is 0.9950.9950.9950.995.
Refer to caption
Figure 8: Failure cases. Mask prediction of a model trained on 1%percent11\%1 % of annotated data and 99%percent9999\%99 % unlabeled (ResNet-50 backbone). The top row shows that the model accurately predicts the three objects in the tote. The bottom row includes an additional item inserted from the instance-bank, which partially overlaps several objects. Although the model correctly segments the inserted object, it completely misses one occluded object.

Appendix A.E Failure Cases

In both the supervised and self-supervised stages, we randomly draw instance-bank objects and distribute them in the image according to a 2d Beta(α,β)𝐵𝑒𝑡𝑎𝛼𝛽Beta(\alpha,\beta)italic_B italic_e italic_t italic_a ( italic_α , italic_β ) distribution (Eq. 2), and prevent object overlap that exceeds 85%percent8585\%85 % by resampling from the distribution in case of such overlap. In the supervised phase, we also ensure that inserted objects do not overlap existing (ground-truth) objects by more than 85%percent8585\%85 %, whereas in the self-supervised phase, the Beta(α,β)𝐵𝑒𝑡𝑎𝛼𝛽Beta(\alpha,\beta)italic_B italic_e italic_t italic_a ( italic_α , italic_β ) distribution (Eq. 2) helps reduce the likelihood of inserted memory-bank objects overlap** actual objects in the image (since no ground-truth is available). Despite these precautions, failure cases still occur, particularly at very low annotation rates. Since our method incorporate noisy pseudo–labels in low annotated data regime, we will follow improvements in noisy spatial labels [12] for combating with noise and improve pseudo–labels. Fig. 8 shows how a model trained on 1%percent11\%1 % of the labeled data (99%percent9999\%99 % treated as unlabeled) accurately predicts the masks of all objects in the “before” image x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (and ignores the background). However, in the “after” image x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (post-interaction), which contains an additionally inserted object (bottom row), the model fails to produce masks for the occluded cardboard box.