Alleviating the Burden of Labeling:
Sentence Generation by Attention Branch Encoder–Decoder Network

Tadashi Ogura1, Aly Magassouba1, Komei Sugiura1,2, Tsubasa Hirakawa3, Takayoshi Yamashita3,
Hironobu Fujiyoshi3, and Hisashi Kawai1
Manuscript received: February 22, 2020; Revised June 2, 2020; Accepted July 7, 2020. This paper was recommended for publication by Editor Dongheui Lee upon evaluation of the Associate Editor and Reviewers’ comments.1Authors are with the National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika, Soraku, Kyoto 619-0289, Japan. [email protected]2Author is with Keio University, 3-14-1 Hiyoshi, Kohoku, Yokohama, Kanagawa 223-8522, Japan. [email protected]3Authors are with Chubu University, 1200 Matsumotocho, Kasugai, Aichi 487-8501, Japan. {[email protected], takayoshi@isc, fujiyoshi@isc}.chubu.ac.jpDigital Object Identifier (DOI): see top of this page.
Abstract

We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera’s field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.

Index Terms:
Novel Deep Learning Methods, Deep Learning for Visual Perception

I Introduction

In modern aging societies, the demand for assistance and support in daily life is increasing; however, there is a feared shortage of home caregivers. As a possible solution, domestic service robots (DSRs) capable of providing physical assistance for caregiving are attracting significant attention. Allowing care recipients to give instructions to DSRs in natural language could greatly increase convenience.

Refer to caption
Figure 1: Overview of our method. Our method generates a polygon-based segmentation mask for the target object of a given instruction and image. We introduce the Polygon Matching Loss. The LLM Paraphraser, SBAE, OVA, VCI, and OTVP are explained in Section IV.

However, such instructions sometimes incorporate out-of-vocabulary words, complex referring expressions and redundant phrases. This complexity makes it challenging for DSRs to understand such instructions and identify target objects.

In this study, we focus on the task of generating segmentation masks of the target object given open vocabulary instructions related to object manipulation. This task is important because it is convenient for users if robots can understand and execute object manipulation based on natural language instructions. For instance, given the instruction, “Go to the living room and bring me the pillow that is closest to the potted plant,” it is required to generate a segmentation mask for the pillow that is closest to the potted plant. Segmentation masks are more desirable than bounding boxes for object manipulation because it is desirable to accurately predict the position and shape of target objects.

Although our target task is closely related to the referring expression segmentation (RES) task[hu2016segmentation], the instructions in our target task often involve two or more sentences. Therefore, it is necessary to identify the object by considering complex relationships between vision and language. Thus, our target task is more challenging than the simple RES task. For instance, consider the instruction “Go downstairs to the open living room with the white fireplace and straighten out the book display next to it.” Referring only to “the book display next to it” is too ambiguous to appropriately identify the target object. In this case, “the white fireplace” indirectly modifies the target object and is important for understanding the instruction.

Although many models [yang2022lavt, wang2021cris, iioka2023mdsm] have been successfully applied to the RES task, most of them do not fully handle multiple sentences. Furthermore, they are also unable to handle the referring expressions of objects that exist outside the camera’s view. Recently, some studies show that proposed polygon-based mask generation methods can achieve shorter inference times compared with traditional pixel-based mask generation methods[peng2020deep, liu2021dance, liu2023polyformer, wang2021cris]. However, most of them cannot account for cases in which the order of vertices differs while still representing the same polygon.

In this study, we propose a model that generates a segmentation mask for the target object specified in a natural language instruction. One of the main differences between our method and existing methods is the introduction of the Polygon Matching Loss (PML), which uses optimal transport for vertex matching. Another significant difference is the introduction of Open-Vocabulary 3D Aggregator (OVA), which handles open-vocabulary multimodal features for objects that exist outside the camera’s field of view.

Introducing PML enables the model to handle cases in which the order of vertices differs while still representing the same polygon. As a result, we train the model to predict the appropriate masks regardless of the vertex order, thereby enabling effective training. Additionally, we expect the OVA to enhance the association of open-vocabulary multimodal features with referring expressions that refer to objects that exist outside the camera’s field of view.

A summary of our key contributions is as follows:

  • To train the model efficiently, we introduce the Optimal Transport Vertex Predictor (OTVP), with the PML, which uses optimal transport for vertex matching.

  • We introduce the OVA to obtain open-vocabulary multimodal features for objects that exist outside the camera’s field of view.

  • We introduce the Segment-Based Attentional Enhancer (SBAE), which uses segmentation images to enhance the understanding of object shapes and their spatial relationships.

Refer to caption
Figure 2: Proposed method framework. The proposed method consists of five main modules: LLM Paraphraser, SBAE, OVA, Visual Context Interpreter (VCI), and OTVP. C(,)𝐶C\left(\cdot,\cdot\right)italic_C ( ⋅ , ⋅ ), SAM, OpenScene represent the cost function, Segment Anything Model[Kirillov_2023_ICCV], and Open Scene[Peng2023OpenScene], respectively.

II Related Work

In the field of multimodal language processing, numerous studies have been conducted[UPPAL2022149, chen2023vlp, gu2022vision, zhu2023survey] and multimodal large language models (LLMs) have led to notable successes[gpt4v, dai2023instructblip]. Multimodal LLMs have progressed rapidly, and have been successfully applied to task planning [Driess2023palme, brohan2023can] and the generation of code for action sequences [vemprala2023chatgpt, liang2023code]. These approaches have been widely applied in robotics [miyazawa2023survey, xiao2023robot, kawaharazuka2024real, zeng2023large].

Referring expression comprehension (REC) and RES are two major tasks in which models are required to predict specific regions within images based on referring expressions (e.g., [Yu_2018_CVPR, Luo_2020_CVPR, Ye_2019_CVPR]). REC often requires predicting the rectangular regions of target objects given images and referring expressions[kamath2021mdetr, deng2021transvg, UNINEXT]. Therefore, in this study, we focus on segmentation generation tasks rather than REC tasks.

Most RES models predict the pixel-level masks of target objects[yang2022lavt, wang2021cris, huang2020referring, zou2023segment, GRES], whereas some predict the vertices of a polygon that represents the target object[zhu2022seqtr, liu2023polyformer]. Although these polygon-based mask generation methods are similar to our method, they cannot account for cases in which the order of vertices differs but represents the same polygon. Unlike them, we introduce polygon matching with optimal transport in mask generation to achieve efficient training. As a result, despite appropriately predicting the set of vertices, most existing methods do not consider polygons that are similar, which results in the inefficient training of the model.

Additionally, several studies have been conducted aimed at referring expression understanding for DSRs [iioka2023mdsm, kaneda2024learning, korekata2023switch, karamcheti2023voltron]. These tasks involve decomposing high-level instructions into atomic actions and executing them [karamcheti2023voltron, lynch2023interactive, chen2023polarnet], and identifying the target objects specified in the instruction sentences [kaneda2024learning, iioka2023mdsm, homerobot, parashar2023slap]. The authors of [iioka2023mdsm] proposed MDSM, which is a two-stage segmentation model designed to refine masks generated by DDPM [ho2020ddpm]. Unlike MDSM, our method handles information about objects that exist outside the camera’s field of view.

Many datasets for referring expression understanding have been proposed RefCOCO[kazemzadeh2014referitgame], RefCOCO+[yu2016modeling], G-Ref[mao2016generation]. In the field of robotics, datasets that contain natural language instruction sentences [hatori2018interactively, qi2020reverie, shridhar2020alfred] are used for multimodal language understanding tasks. These datasets focus on object manipulation tasks within an indoor environment. [hatori2018interactively, qi2020reverie] are notable studies because they were based on real-world data. In particular, the instruction sentences included in the REVERIE dataset[qi2020reverie] often consist of multiple sentences, which make the task of identifying the target object particularly challenging.

III Problem Statement

In this study, we focus on the task that involves generating the segmentation mask of the target object from an image of the indoor environment, 3D point clouds, and an instruction related to object manipulation. We define this task as the Object Segmentation from Manipulation Instructions-3D (OSMI-3D) task. In this task, the model should generate a segmentation mask for the target object indicated in the instruction. Fig. 1 shows a typical input of the OSMI-3D task. The goal is to generate a mask, which is indicated by the red area, given an instruction such as “Go to the living room and bring me the pillow that is closest to the potted plant.”

We define the inputs and an output as follows:

  • Inputs: an image, 3D point cloud, and an instruction sentences.

  • Output: a pixel-wise segmentation mask of the target object indicated in the instrucion.

In this study, we do not assume cases in which there are multiple target objects or no target object in a single image. We use mean intersection over union (mIoU) and precision as the evaluation metrics.

IV Proposed Method

The proposed method predicts a segmentation mask for the target object referred to in the given object manipulation instructions. Our key contributions are as follows:

  • To train the model efficiently, we introduce the Optimal Transport Vertex Predictor (OTVP), with the PML, which uses optimal transport for vertex matching.

  • We introduce the OVA to obtain the open-vocabulary multimodal features of objects that exist outside the camera’s field of view. It can handle their correspondence with referring expressions.

  • We introduce the SBAE to enhance the understanding of attributes, such as shape and spatial relationships, based on the segmentation images.

Fig. 2 shows the overview of the proposed method. It consists of five main modules: LLM Paraphraser, SBAE, OVA, Visual Context Interpreter (VCI) and OTVP. Our method, particularly the proposed PML, can be widely applied to polygon-based mask generation models[liu2023polyformer, zhu2022seqtr].

The inputs are defined as 𝒙={𝒙img,Xpcl,𝒙inst}𝒙subscript𝒙imgsubscript𝑋pclsubscript𝒙inst\bm{x}=\{\bm{x}_{\mathrm{img}},X_{\mathrm{pcl}},\bm{x}_{\mathrm{inst}}\}bold_italic_x = { bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT roman_pcl end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT roman_inst end_POSTSUBSCRIPT } where 𝒙img3×H×Wsubscript𝒙imgsuperscript3𝐻𝑊\bm{x}_{\mathrm{img}}\in\mathbb{R}^{3\times H\times W}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT, Xpcl={𝝃ii=0,1,2,,Npcl}subscript𝑋pclconditional-setsubscript𝝃𝑖𝑖012subscript𝑁pclX_{\mathrm{pcl}}=\{\bm{\xi}_{i}\mid i=0,1,2,\ldots,N_{\mathrm{pcl}}\}italic_X start_POSTSUBSCRIPT roman_pcl end_POSTSUBSCRIPT = { bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i = 0 , 1 , 2 , … , italic_N start_POSTSUBSCRIPT roman_pcl end_POSTSUBSCRIPT } and 𝒙inst{0,1}v×lsubscript𝒙instsuperscript01𝑣𝑙\bm{x}_{\mathrm{inst}}\in\{0,1\}^{v\times l}bold_italic_x start_POSTSUBSCRIPT roman_inst end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_v × italic_l end_POSTSUPERSCRIPT denote an image, 3D point clouds and an instruction sentences tokenized as a one-hot vector, respectively. Note that H𝐻Hitalic_H, W𝑊Witalic_W, 𝝃isubscript𝝃𝑖\bm{\xi}_{i}bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Npclsubscript𝑁pclN_{\mathrm{pcl}}italic_N start_POSTSUBSCRIPT roman_pcl end_POSTSUBSCRIPT, v𝑣vitalic_v and l𝑙litalic_l denote the height of the image, width of the image, i𝑖iitalic_i-th point, total number of points in a point cloud, vocabulary size and max token length of the instruction, respectively.

IV-A LLM Paraphraser

Unlike the RES task, the OSMI-3D task often involves two or more sentences. As shown later, typical RES models do not handle such cases and may focus on phrases unrelated to the target object. To improve the understanding of phrases associated with the target object, we introduce the LLM Paraphraser. Specifically, the LLM Paraphraser combines several sentences and summarizes referring expressions related to the target object into a single sentence.

LLM Paraphraser takes 𝒙instsubscript𝒙inst\bm{x}_{\mathrm{inst}}bold_italic_x start_POSTSUBSCRIPT roman_inst end_POSTSUBSCRIPT as input. It summarizes 𝒙instsubscript𝒙inst\bm{x}_{\mathrm{inst}}bold_italic_x start_POSTSUBSCRIPT roman_inst end_POSTSUBSCRIPT using an LLM (GPT-3.5-turbo[gpt35turbo]). For example, when 𝒙instsubscript𝒙inst\bm{x}_{\mathrm{inst}}bold_italic_x start_POSTSUBSCRIPT roman_inst end_POSTSUBSCRIPT is “Go to the dining table. Then pick up the candle on the right, ” the sentence “Pick up the right candle on the dining table.” is obtained. We embed the sentence into the language features 𝒉llpdllpsubscript𝒉llpsuperscriptsubscript𝑑llp\bm{h}_{\mathrm{llp}}\in\mathbb{R}^{d_{\mathrm{llp}}}bold_italic_h start_POSTSUBSCRIPT roman_llp end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_llp end_POSTSUBSCRIPT end_POSTSUPERSCRIPT using the text-embedding-ada-002[adatxtembeddingada002], where dllpsubscript𝑑llpd_{\mathrm{llp}}italic_d start_POSTSUBSCRIPT roman_llp end_POSTSUBSCRIPT is the number of feature dimensions. The output of LLM Paraphraser is 𝒉llpsubscript𝒉llp\bm{h}_{\mathrm{llp}}bold_italic_h start_POSTSUBSCRIPT roman_llp end_POSTSUBSCRIPT.

IV-B Visual Context Interpreter

Previous RES studies can be mainly divided into two approaches for extracting image features: using image encoders (e.g., ResNet[He2015DeepRLresnet] and ViT[dosovitskiy2020image]) to extract visual features such as texture and edges; and using multimodal image encoders (e.g., CLIP[radford2021learning], UNITER [chen2020uniter], and BLIP[li2022blip]) to extract multimodal image features that are aligned with natural language. However, these features sometimes lack visual representations related to complex referring expressions (e.g., “the second chair from the left in the first floor dining room that has the mirror hanging above the fireplace”) and spatial relationships (e.g., “the hand towel on the towel rack to the left of the sink”).

To handle such complex visual representations, we introduce the VCI. In the VCI, multimodal LLMs generate descriptions that include details such as the attributes of objects, their spatial relationships, and their complex relationships in referring expressions. Furthermore, using multimodal LLMs, we can obtain additional common-sense knowledge that is not contained in the image alone. For example, if the scene shows an open door and the outside is visible through the door, it is highly likely to be an entrance.

The inputs of VCI are 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT and 𝒙instsubscript𝒙inst\bm{x}_{\mathrm{inst}}bold_italic_x start_POSTSUBSCRIPT roman_inst end_POSTSUBSCRIPT, and the output is the intermediate feature 𝒉vcidvcisubscript𝒉vcisuperscriptsubscript𝑑vci\bm{h}_{\mathrm{vci}}\in\mathbb{R}^{d_{\mathrm{vci}}}bold_italic_h start_POSTSUBSCRIPT roman_vci end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_vci end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dvcisubscript𝑑vcid_{\mathrm{\mathrm{vci}}}italic_d start_POSTSUBSCRIPT roman_vci end_POSTSUBSCRIPT represents the dimension. First, we obtain a description of 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT using a multimodal LLM (gpt-4-vision-preview [gpt4v]). We embed the description into 𝒉vcisubscript𝒉vci\bm{h}_{\mathrm{vci}}bold_italic_h start_POSTSUBSCRIPT roman_vci end_POSTSUBSCRIPT using the text-embedding-ada-002.

IV-C Open-Vocabulary 3D Aggregator

Existing methods [iioka2023mdsm, yang2022lavt, zhu2022seqtr] often fail to identify the target object given the referring expressions of objects outside the field of view. To address this issue, we introduce the OVA to enhance the understanding of referring expressions for objects outside the field of view. This module aligns 3D point clouds with open-vocabulary multimodal features and links them to referring expressions. As a result, it is expected to obtain information about objects outside the camera’s field of view without the need to capture images from various angles.

In this module, the input is Xpclsubscript𝑋pclX_{\mathrm{pcl}}italic_X start_POSTSUBSCRIPT roman_pcl end_POSTSUBSCRIPT and the output is the intermediate feature 𝒉ovadovasubscript𝒉ovasuperscriptsubscript𝑑ova\bm{h}_{\mathrm{ova}}\in\mathbb{R}^{d_{\mathrm{ova}}}bold_italic_h start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dovasubscript𝑑ovad_{\mathrm{ova}}italic_d start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT denotes the dimensionality. First, from Xpclsubscript𝑋pclX_{\mathrm{pcl}}italic_X start_POSTSUBSCRIPT roman_pcl end_POSTSUBSCRIPT, we extract the Nnearsubscript𝑁nearN_{\mathrm{near}}italic_N start_POSTSUBSCRIPT roman_near end_POSTSUBSCRIPT points closest to the position where 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT was captured. This subset is denoted by Xnearsubscript𝑋nearX_{\mathrm{near}}italic_X start_POSTSUBSCRIPT roman_near end_POSTSUBSCRIPT. We use only Nnearsubscript𝑁nearN_{\mathrm{near}}italic_N start_POSTSUBSCRIPT roman_near end_POSTSUBSCRIPT points because referring expressions often refer to objects around the target object, and using points at remote locations is not efficient. We obtain multimodal features 𝒉ova=f(Xnear)subscriptsuperscript𝒉ova𝑓subscript𝑋near\bm{h}^{\prime}_{\mathrm{ova}}=f\left(X_{\mathrm{near}}\right)bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT = italic_f ( italic_X start_POSTSUBSCRIPT roman_near end_POSTSUBSCRIPT ). Note that, f()𝑓f\left(\cdot\right)italic_f ( ⋅ ) denotes the application of pre-trained OpenScene[Peng2023OpenScene]. OpenScene embeds multimodal features into each point of a 3D point cloud using CLIP[radford2021learning]. Finally, we obtain the feature 𝒉ova=MaxPool(Upsample(𝒉ova))subscript𝒉ovaMaxPoolUpsamplesubscriptsuperscript𝒉ova\bm{h}_{\mathrm{ova}}=\mathrm{MaxPool}\left(\mathrm{Upsample}\left(\bm{h}^{% \prime}_{\mathrm{ova}}\right)\right)bold_italic_h start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT = roman_MaxPool ( roman_Upsample ( bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT ) ) where MaxPool()MaxPool\mathrm{MaxPool}\left(\cdot\right)roman_MaxPool ( ⋅ ) and Upsample()Upsample\mathrm{Upsample}\left(\cdot\right)roman_Upsample ( ⋅ ) denotes upsampling process and max pooling, respectively.

IV-D Segment-Based Attentional Enhancer

Refer to caption
Figure 3: Structure of the SBAE. This enhances the understanding of object segment information, and fuses visual and linguistic features. CA, SAM and FFN represent cross-attention, the Segment Anything Model[Kirillov_2023_ICCV] and a feed-forward network, respectively.

Existing RES and OSMI models sometimes inappropriately predict the contours of objects. To address this, we introduce the SBAE to enhance the understanding of segment information related to target objects.

Fig. 3 shows the structure of the SBAE. It extracts image features at multiple resolutions from the 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT and a segmentation image, and then integrates these features with 𝒉llpsubscript𝒉llp\bm{h}_{\mathrm{llp}}bold_italic_h start_POSTSUBSCRIPT roman_llp end_POSTSUBSCRIPT, 𝒉vcisubscript𝒉vci\bm{h}_{\mathrm{vci}}bold_italic_h start_POSTSUBSCRIPT roman_vci end_POSTSUBSCRIPT, and 𝒉ovasubscript𝒉ova\bm{h}_{\mathrm{ova}}bold_italic_h start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT from each module. The inputs to this module are 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT, 𝒉llpsubscript𝒉llp\bm{h}_{\mathrm{llp}}bold_italic_h start_POSTSUBSCRIPT roman_llp end_POSTSUBSCRIPT, 𝒉vcisubscript𝒉vci\bm{h}_{\mathrm{vci}}bold_italic_h start_POSTSUBSCRIPT roman_vci end_POSTSUBSCRIPT and 𝒉ovasubscript𝒉ova\bm{h}_{\mathrm{ova}}bold_italic_h start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT. First, we use the pre-trained SAM[Kirillov_2023_ICCV] to obtain a segmentation image 𝒔3×H×W𝒔superscript3𝐻𝑊\bm{s}\in\mathbb{R}^{3\times H\times W}bold_italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT given 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT. As shown in the figure, we obtain image features {𝑽kHk×Wk×Ck}k=1Mvsuperscriptsubscriptsubscript𝑽𝑘superscriptsubscript𝐻𝑘subscript𝑊𝑘subscript𝐶𝑘𝑘1subscript𝑀𝑣\{\bm{V}_{k}\in\mathbb{R}^{H_{k}\times W_{k}\times C_{k}}\}_{k=1}^{M_{v}}{ bold_italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from intermediate layers at Mvsubscript𝑀𝑣M_{v}italic_M start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT different resolutions using DarkNet-53[Wang_2021_CVPR] pre-trained on the MS-COCO[lin2014microsoft] dataset.

Similarly, we obtain image features {𝑽lHl×Wl×Cl}l=1Mssubscriptsuperscriptsubscriptsuperscript𝑽𝑙superscriptsubscript𝐻𝑙subscript𝑊𝑙subscript𝐶𝑙subscript𝑀𝑠𝑙1\left\{\bm{V}^{\prime}_{l}\in\mathbb{R}^{H_{l}\times W_{l}\times C_{l}}\right% \}^{M_{s}}_{l=1}{ bold_italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT from Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT types of intermediate layers from 𝒔𝒔\bm{s}bold_italic_s. Note that H𝐻Hitalic_H and W𝑊Witalic_W denote the height and width of the image feature maps of 𝑽ksubscript𝑽𝑘\bm{V}_{k}bold_italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝑽lsubscriptsuperscript𝑽𝑙\bm{V}^{\prime}_{l}bold_italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, respectively, and Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the number of channels of 𝑽ksubscript𝑽𝑘\bm{V}_{k}bold_italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝑽lsubscriptsuperscript𝑽𝑙\bm{V}^{\prime}_{l}bold_italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

As shown in Fig. 3, we downsample each obtained 𝑽ksubscript𝑽𝑘\bm{V}_{k}bold_italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝑽lsubscriptsuperscript𝑽𝑙\bm{V}^{\prime}_{l}bold_italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and then obtain 𝑽mixsubscript𝑽mix\bm{V}_{\mathrm{mix}}bold_italic_V start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT by concatenating them in the channel dimension. Furthermore, we obtain 𝒉mixsubscript𝒉mix\bm{h}_{\mathrm{mix}}bold_italic_h start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT by concatenating 𝒉llpsubscript𝒉llp\bm{h}_{\mathrm{llp}}bold_italic_h start_POSTSUBSCRIPT roman_llp end_POSTSUBSCRIPT, 𝒉vcisubscript𝒉vci\bm{h}_{\mathrm{vci}}bold_italic_h start_POSTSUBSCRIPT roman_vci end_POSTSUBSCRIPT and 𝒉ovasubscript𝒉ova\bm{h}_{\mathrm{ova}}bold_italic_h start_POSTSUBSCRIPT roman_ova end_POSTSUBSCRIPT in the channel dimension, and 𝒉mixsubscript𝒉mix\bm{h}_{\mathrm{mix}}bold_italic_h start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT is downsampled. Finally, we compute the cross-attention between 𝑽mixsubscript𝑽mix\bm{V}_{\mathrm{mix}}bold_italic_V start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT and 𝒉mixsubscript𝒉mix\bm{h}_{\mathrm{mix}}bold_italic_h start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT to obtain the intermediate feature 𝑺a=fa(𝑽mix,𝒉mix)subscript𝑺𝑎subscript𝑓𝑎subscript𝑽mixsubscript𝒉mix\bm{S}_{a}=f_{a}(\bm{V}_{\mathrm{mix}},\bm{h}_{\mathrm{mix}})bold_italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_italic_V start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT roman_mix end_POSTSUBSCRIPT ). Note that fa(,)subscript𝑓𝑎f_{a}\left(\cdot,\cdot\right)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ⋅ , ⋅ ) denotes the cross-attention function. Furthermore, fa(,)subscript𝑓𝑎f_{a}\left(\cdot,\cdot\right)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ⋅ , ⋅ ) for any matrices 𝑿Asubscript𝑿𝐴\bm{X}_{A}bold_italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and 𝑿Bsubscript𝑿𝐵\bm{X}_{B}bold_italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is defined as follows:

fa(𝑿A,𝑿B)subscript𝑓𝑎subscript𝑿𝐴subscript𝑿𝐵\displaystyle f_{a}(\bm{X}_{A},\bm{X}_{B})italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) =softmax((𝑾q𝑿A)(𝑾k𝑿B)d)(𝑾v𝑿B),absentsoftmaxsubscript𝑾𝑞subscript𝑿𝐴superscriptsubscript𝑾𝑘subscript𝑿𝐵top𝑑subscript𝑾𝑣subscript𝑿𝐵\displaystyle=\mathrm{softmax}\left(\frac{(\bm{W}_{q}\bm{X}_{A})(\bm{W}_{k}\bm% {X}_{B})^{\top}}{\sqrt{d}}\right)(\bm{W}_{v}\bm{X}_{B}),= roman_softmax ( divide start_ARG ( bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ( bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) ( bold_italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ,

where 𝑾qsubscript𝑾𝑞\bm{W}_{q}bold_italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, 𝑾ksubscript𝑾𝑘\bm{W}_{k}bold_italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and 𝑾vsubscript𝑾𝑣\bm{W}_{v}bold_italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are trainable weights, and d𝑑ditalic_d is a scaling factor.

IV-E Optimal Transport Vertex Predictor

The OTVP takes 𝑺asubscript𝑺𝑎\bm{S}_{a}bold_italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as input. Output is Y^={𝒚^ii=1,2,,N}^𝑌conditional-setsubscript^𝒚𝑖𝑖12superscript𝑁\hat{Y}=\{\hat{\bm{y}}_{i}\mid i=1,2,\ldots,N^{\prime}\}over^ start_ARG italic_Y end_ARG = { over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_i = 1 , 2 , … , italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT }, where 𝒚^isubscript^𝒚𝑖\hat{\bm{y}}_{i}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the coordinate of a vertex. The OTVP consists of a transformer encoder and transformer decoder. The encoder and decoder consist of nencsubscript𝑛encn_{\mathrm{enc}}italic_n start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT and ndecsubscript𝑛decn_{\mathrm{dec}}italic_n start_POSTSUBSCRIPT roman_dec end_POSTSUBSCRIPT transformer layers, respectively. By inputting 𝑺asubscript𝑺𝑎\bm{S}_{a}bold_italic_S start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT into this encoder-decoder and applying linear projection, we obtain Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG.

Existing methods[zhu2022seqtr, liu2023polyformer, liu2021dance, peng2020deep] often fail to account for cases in which the order of vertices differs but represents the same polygon. The bottom-right diagram and bottom-left diagram in Fig. 1 show an example. The green and red shapes in the figure represent the predicted and correct masks, respectively. In the bottom-left diagram in Fig. 1, the two sets of vertices represent polygons of the same shape. However, the order of vertices is different. Despite appropriately predicting the set of vertices, most existing methods do not consider the polygons to be similar, which results in a significant loss. This can lead to inefficient training of the model.

We introduce the PML OTsubscript𝑂𝑇\mathcal{L}_{OT}caligraphic_L start_POSTSUBSCRIPT italic_O italic_T end_POSTSUBSCRIPT to effectively address such cases using optimal transport. As shown in the bottom-right diagram in Fig. 1, it involves matching between Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and the set of vertices of reference mask Y={𝒚jj=1,2,,N}𝑌conditional-setsubscript𝒚𝑗𝑗12𝑁Y=\{\bm{y}_{j}\mid j=1,2,\ldots,N\}italic_Y = { bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j = 1 , 2 , … , italic_N } using optimal transport, where N𝑁Nitalic_N represents the number of vertices in the reference mask’s set of vertices.

When Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and Y𝑌Yitalic_Y are given, we can identify a transportation plan that transfers Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG to Y𝑌Yitalic_Y at the minimum transportation cost. We regard Y^^𝑌\hat{Y}over^ start_ARG italic_Y end_ARG and Y𝑌Yitalic_Y as two discrete distributions α=i=1N𝒂iδ𝒚^i𝛼superscriptsubscript𝑖1superscript𝑁subscript𝒂𝑖subscript𝛿subscript^𝒚𝑖\alpha=\sum_{i=1}^{N^{\prime}}\bm{a}_{i}\delta_{\hat{\bm{y}}_{i}}italic_α = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and β=j=1N𝒃jδ𝒚j𝛽superscriptsubscript𝑗1𝑁subscript𝒃𝑗subscript𝛿subscript𝒚𝑗\beta=\sum_{j=1}^{N}\bm{b}_{j}\delta_{\bm{y}_{j}}italic_β = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, respectively. Note that δ𝒚^isubscript𝛿subscript^𝒚𝑖\delta_{\hat{\bm{y}}_{i}}italic_δ start_POSTSUBSCRIPT over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and δ𝒚jsubscript𝛿subscript𝒚𝑗\delta_{\bm{y}_{j}}italic_δ start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT represent the Dirac delta function centered on 𝒚^isubscript^𝒚𝑖\hat{\bm{y}}_{i}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒚jsubscript𝒚𝑗\bm{y}_{j}bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively. The weight vectors 𝒂𝒂\bm{a}bold_italic_a and 𝒃𝒃\bm{b}bold_italic_b are normalized. Using this, we define OTsubscript𝑂𝑇\mathcal{L}_{OT}caligraphic_L start_POSTSUBSCRIPT italic_O italic_T end_POSTSUBSCRIPT as follows:

OT=min𝑷𝒰(𝒂,𝒃)i=1Nj=1NC(𝒚^i,𝒚j)Pij,subscript𝑂𝑇subscript𝑷𝒰𝒂𝒃superscriptsubscript𝑖1superscript𝑁superscriptsubscript𝑗1𝑁𝐶subscript^𝒚𝑖subscript𝒚𝑗subscript𝑃𝑖𝑗\displaystyle\mathcal{L}_{OT}=\min_{\bm{P}\in\mathcal{U}(\bm{a},\bm{b})}\sum_{% i=1}^{N^{\prime}}\sum_{j=1}^{N}C(\hat{\bm{y}}_{i},\bm{y}_{j})P_{ij},caligraphic_L start_POSTSUBSCRIPT italic_O italic_T end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT bold_italic_P ∈ caligraphic_U ( bold_italic_a , bold_italic_b ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_C ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (1)
𝒰(𝒂,𝒃)={𝑷N×NPij0,𝑷𝟏N=𝒂,𝑷𝟏N=𝒃},𝒰𝒂𝒃conditional-set𝑷superscriptsuperscript𝑁𝑁formulae-sequencesubscript𝑃𝑖𝑗0formulae-sequence𝑷subscript1superscript𝑁𝒂superscript𝑷topsubscript1𝑁𝒃\displaystyle\mathcal{U}(\bm{a},\bm{b})=\left\{\bm{P}\in\mathbb{R}^{{N^{\prime% }}\times{N}}\mid P_{ij}\geq 0,\bm{P}\mathbf{1}_{N^{\prime}}=\bm{a},\bm{P}^{% \top}\mathbf{1}_{N}=\bm{b}\right\},caligraphic_U ( bold_italic_a , bold_italic_b ) = { bold_italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_N end_POSTSUPERSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≥ 0 , bold_italic_P bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = bold_italic_a , bold_italic_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = bold_italic_b } ,

where 𝟏Nsubscript1superscript𝑁\bm{1}_{N^{\prime}}bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and 𝟏Nsubscript1𝑁\bm{1}_{N}bold_1 start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT denote vectors of dimension Nsuperscript𝑁N^{\prime}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and N𝑁Nitalic_N, respectively, with all components equal to 1. Furthermore, C(𝒚^i,𝒚j)=𝒚^i𝒚j2𝐶subscriptbold-^𝒚𝑖subscript𝒚𝑗subscriptnormsubscriptbold-^𝒚𝑖subscript𝒚𝑗2C(\bm{\hat{y}}_{i},\bm{y}_{j})=\|\bm{\hat{y}}_{i}-\bm{y}_{j}\|_{2}italic_C ( overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ∥ overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and Pijsubscript𝑃𝑖𝑗{P_{ij}}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denote the transportation cost and transportation plan from 𝒚^isubscriptbold-^𝒚𝑖\bm{\hat{y}}_{i}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 𝒚jsubscript𝒚𝑗\bm{y}_{j}bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively, where ||||2||\cdot||_{2}| | ⋅ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT norm. In this study, we compute Equation (1) efficiently with entropy regularization and subsequently apply the Sinkhorn algorithm[cuturi2013sinkhorn].

\captionsetup

(i) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption (ii) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption (iii) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
      (a) 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT (b) GT (c) LAVT[yang2022lavt] (d) SeqTR[zhu2022seqtr] (e) MDSM[iioka2023mdsm] (f) Ours

Figure 4: Qualitative results of successful and failure cases. (i) and (ii) show successful examples, and (iii) shows a failure example. The instructions for (i), (ii) and (iii) were as follows: “In the 3rd level bathroom, there is a box of tissues to the left of the basin. Please fetch them”; “Walk to the living room and fetch the leftmost pillow on the smaller white sofa, closest to the plant on the small table.” and “Go to the closet in the bedroom with the orange comforter and bring me the second hanger from the top.”
TABLE I: Comparison with baseline methods on the SHIMRIE-3D dataset. The bold numbers represent the highest values for each metric.
Model mIoU [%]\uparrow [email protected] [%]\uparrow [email protected] [%]\uparrow
LAVT[yang2022lavt] 28.16 ±plus-or-minus\pm± 2.85 26.46 ±plus-or-minus\pm± 4.01 18.75 ±plus-or-minus\pm± 3.29
SeqTR [zhu2022seqtr] 21.84 ±plus-or-minus\pm± 2.28 17.87 ±plus-or-minus\pm± 7.00 5.16 ±plus-or-minus\pm± 5.26
MDSM[iioka2023mdsm] 24.36 ±plus-or-minus\pm± 3.87 22.49 ±plus-or-minus\pm± 5.46 13.71 ±plus-or-minus\pm± 3.34
Ours 38.16 ±plus-or-minus\pm± 2.46 48.85 ±plus-or-minus\pm± 2.70 22.29 ±plus-or-minus\pm± 3.32
TABLE II: Quantitative results on ablation studies. The bold numbers represent the highest values for each metric. The columns labeled VCI, OVA, SBAE, and PML indicate whether each module is included, as indicated by a check mark.
Model VCI OVA SBAE PML mIoU [%]\uparrow [email protected] [%]\uparrow [email protected] [%]\uparrow
(i) 35.27 ±plus-or-minus\pm± 5.41 45.31 ±plus-or-minus\pm± 7.64 19.48 ±plus-or-minus\pm± 4.99
(ii) 37.36 ±plus-or-minus\pm± 2.55 48.11 ±plus-or-minus\pm± 4.13 27.24 ±plus-or-minus\pm± 4.99
(iii) 31.77 ±plus-or-minus\pm± 0.92 37.86 ±plus-or-minus\pm± 2.06 14.00 ±plus-or-minus\pm± 4.28
(iv) 33.07 ±plus-or-minus\pm± 3.44 41.04 ±plus-or-minus\pm± 6.74 20.42 ±plus-or-minus\pm± 8.18
(v) 38.16 ±plus-or-minus\pm± 2.46 48.85 ±plus-or-minus\pm± 2.70 22.29 ±plus-or-minus\pm± 3.32

V Experimental Results

V-A Quantitative Results

Table I shows the quantitative results of the comparison of the baseline methods and proposed method. We conducted the experiments five times each. The averages and standard deviations of mIoU and P@k𝑘kitalic_k (k𝑘kitalic_k=0.5, 0.7) are shown in the table. Furthermore, the bold numbers in Table I represent the highest values for each metric.

As the baseline methods, we used MDSM[iioka2023mdsm], LAVT[yang2022lavt], and SeqTR[zhu2022seqtr]. We chose them as baseline methods for the following reasons. We used MDSM because it has been successfully applied to the OSMI task. Furthermore, we used LAVT and SeqTR because these have been successfully applied to the RES task, which is closely related to the OSMI-3D task.

We used mIoU and Precision at k𝑘kitalic_k (P@k𝑘kitalic_k) as the evaluation metrics because they are standard metrics in RES tasks, closely associated with the OSMI-3D task. We chose mIoU as the primary metric. mIoU is defined as mIoU=(1/Ns)i=1Ns|YiY^i|/|YiY^i|mIoU1subscript𝑁𝑠superscriptsubscript𝑖1subscript𝑁𝑠subscript𝑌𝑖subscript^𝑌𝑖subscript𝑌𝑖subscript^𝑌𝑖\mathrm{mIoU}=({1}/N_{s})\sum_{i=1}^{N_{s}}|Y_{i}\cap\hat{Y}_{i}|/|Y_{i}\cup% \hat{Y}_{i}|roman_mIoU = ( 1 / italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | / | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∪ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, where Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Y^isubscript^𝑌𝑖\hat{Y}_{i}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and Yisubscript𝑌𝑖Y_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the number of samples, and the sets of pixels corresponding to the predicted mask and GT mask, respectively, in the i𝑖iitalic_i-th sample. P@k𝑘kitalic_k is defined as P@k=Tk/NsP@𝑘subscript𝑇𝑘subscript𝑁𝑠\mathrm{P}@k={T_{k}}/N_{s}roman_P @ italic_k = italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the number of samples for which the IoU between a predicted mask and a GT mask exceeded the threshold k𝑘kitalic_k.

Table I shows that the proposed method achieved an mIoU of 38.16%, whereas LAVT, SeqTR, and MDSM were 28.16%, 21.84%, and 24.36%, respectively. The proposed method outperformed the best result for the baseline methods, which was obtained by LAVT, by 10.00 points. Table I also shows that [email protected] for LAVT, SeqTR, MDSM, and the proposed method were 26.46%, 17.87%, 22.49%, and 48.85%, respectively. From the above, the proposed method outperformed the highest performing LAVT in terms of [email protected] by 22.39 points. Similarly, the proposed method also outperformed the baseline methods in termes of [email protected].

V-B Qualitative Results

Fig. 4 shows the qualitative results. In the figure, columns (a) and (b) represent 𝒙imgsubscript𝒙img\bm{x}_{\mathrm{img}}bold_italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT and GT, respectively. Additionally, columns (c), (d), (e), and (f) represent the masks predicted by LAVT, SeqTR, MDSM and the proposed method, respectively. Fig. 4 (i)(ii) show successful examples.

Fig. 4 (i) shows an example in which the instruction was “In the 3rd level bathroom, there is a box of tissues to the left of the basin. Please fetch them here.” In this example, neither LAVT nor MDSM generated any masks, whereas SeqTR incorrectly masked the magazine. Conversely, the proposed method appropriately generated a mask for the tissue box, which demonstrates that it successfully identified the target object specified in the instruction. In the example from Fig. 4 (ii), the instruction was “Walk to the living room and fetch me the leftmost pillow on the smaller white sofa, the pillow closest to the plant on the small table.” In this case, LAVT and MDSM also did not generate masks for the object, and SeqTR generated a mask for a different object on the table. By contrast, the proposed method appropriately generated a mask for the beige cushion. We consider that the proposed method is capable of understanding referring expressions related to color and spatial relationships.

Fig. 4 (iii) illustrates a failure example. Fig. 4 (iii) shows an example with the instruction sentence “Go to the closet in the bedroom with the orange comforter and bring me the second hanger on top.” In this example, both LAVT and MDSM generated an under-segmented mask for two hangers, and SeqTR masked unrelated areas. Our method masked the hanger on the right-hand side, but the target object specified in the instruction sentence was the left hanger. In this example, the phrase “second hanger” in the instruction sentence was ambiguous, which made it difficult to select a single target object.

Refer to caption\captionsetup
Refer to caption
Refer to caption

(a) GT

(b) Model (ii)

(c) Model (v)

Figure 5: The instruction sentence for this example was “Go to the bathroom on level 1 and bring me the picture furthest to the left.” In this case, the mask generated by Model (v) was slightly skewed toward the sink.

V-C Ablation Studies

We set the following four conditions for the ablation studies:

VCI ablation We removed the VCI and assessed the contributions. Table II shows that the mIoU in for Model (i) was 35.27%, which was 2.89 points lower than that for Model (v). [email protected] and [email protected] for Model (i) were also lower than that for Model (v). From the above, VCI contributed to the improvement of performance. This indicates that VCI enhanced the understanding of referring expressions.

OVA ablation We investigated the performance of OVA by removing it. Table II indicates that the mIoU for Model (ii) was 37.36%, which was 0.8 points lower than that for the Model (v). [email protected] also decreased. By contrast, [email protected] increased by 4.95 points. This may indicate that information about objects that exist outside the camera’s field of view was obtained by the OVA. However, using the OVA to obtain information around objects or the environment may inadvertently lead to focusing on nouns other than the target object. For example, Fig. 5 shows an example where the instruction sentence was “Go to the bathroom on level 1 and bring me the picture furthest to the left” for which the mask generated by Model (v) was slightly skewed toward the sink. This is likely to have occurred because the model excessively focused on the word ‘bathroom’ in the instruction, influenced by the feature of a sink related to ‘bathroom,’ which was obtained through the OVA. As a result, it is possible that the model could not focus strongly on the word ‘picture,’ which was the target object.

SBAE ablation To investigate the effectiveness of the SAM module in the SBAE, we removed it. Table II illustrates that Model (iii) achieved an mIoU of 31.77%, which was 6.39 points lower than that of Model (v). It also scored lower in terms of [email protected] and [email protected]. This suggests that the SBAE enhanced the understanding of segment information about objects, thereby enabling the more appropriate prediction of object contours.

PML ablation We investigated the implications for the performance of PML. Table II shows that the mIoU for Model (iv) was 33.07%. This result was 5.09 points lower than that for Model (v) and it was also lower in terms of [email protected] and [email protected]. This indicates that effective training was achieved using PML.

V-D Error Analysis

In this study, we defined failed cases as those with an IoU lower than 0.5. Based on this definition, the proposed method failed on 179 test samples. We analyzed 100 samples with the lowest IoU values out of 179 samples of failure cases. Table III describes the categories of the failure cases.

We roughly divided the cases into five types:

  • (a)

    Serious comprehension error
    This category includes failure cases in which our model incorrectly segmented a large part of objects that were not mentioned in the instruction. For example, our model incorrectly segmented ‘wall’ given the instruction “Clean the decoration on the table.” This is presumably because our model failed to align the image and language.

  • (b)

    Reference/exophora resolution error
    This category represents cases in which our model incorrectly segmented objects in the same category that were different from the target object. For instance, our model improperly segmented “the picture on the right-hand side” following the instruction “Bring the leftmost picture on the wall.” This is presumably because of our model’s failure to understand the referring expressions appropriately.

  • (c)

    Segmentation of non-target objects
    This category refers to cases in which our model segmented non-target objects in the instructions. For example, ‘bed’ was segmented given the instruction “Fetch me a pillow on the bed.”

  • (d)

    Hallucination in VCI
    The cases in this category involve multimodal LLM in VCI inappropriately describing the appearance and position of objects or non-existent objects. An example of this is, when there was no cushion in the room, the multimodal LLM generated the sentence “The cushion on the left is white.”

  • (e)

    Ambiguous instruction
    This category refers to cases in which the instructions included ambiguous expressions about the name or location of the target object, which made it difficult to identify the target object. Suppose that the instruction “Please bring the second hanger” was provided and the image contained multiple hangers. It would remain unclear which hanger was being referred to.

TABLE III: Error analysis on failure cases.
Errors # of Errors
Serious comprehension error 43
Reference/exophora resolution error 32
Segmentation of non-target objects 13
Hallucination in VCI
10
Ambiguous instruction 2
Total 100

Table III indicates that the main bottlenecks were (a) and (b). We consider that the reason for the former was that the model failed to ground referring expressions with their corresponding target objects. As a solution, we may be able to use SEEM [zou2023segment]. SEEM performs open vocabulary panoptic segmentation, where textual features and visual prompt features are aligned in a joint visual-semantic space. To avoid focusing on irrelevant stuff or things, we can consider masking them out using this semantic labeling approach. Furthermore, for the latter, the model often misunderstood the spatial relationships between a target object and its surroundings. Additionally, there were some cases in which VCI failed to clearly describe the positional relationships between objects. Therefore, a solution may be to improve the prompt so that it focuses on spatial relationships.

VI Conclusions

In this study, we focused on the OSMI-3D task, where models generated segmentation masks of the target object given an image of the indoor environment, 3D point clouds, and an instruction sentence related to object manipulation. Our method outperformed the baseline methods on all standard metrics in the OSMI-3D task. For future research, we plan to implement a semantic labeling approach to mask out irrelevant stuff, thereby ensuring that the focus remains on pertinent things.

ACKNOWLEDGMENT

This work was partially supported by JSPS KAKENHI Grant Number 20H04269, JST Moonshot, JST CREST, and NEDO.