Alleviating the Burden of Labeling:
Sentence Generation by Attention Branch Encoder–Decoder Network

Tadashi Ogura¹, Aly Magassouba¹, Komei Sugiura^1,2, Tsubasa Hirakawa³, Takayoshi Yamashita³,
Hironobu Fujiyoshi³, and Hisashi Kawai¹ Manuscript received: February 22, 2020; Revised June 2, 2020; Accepted July 7, 2020. This paper was recommended for publication by Editor Dongheui Lee upon evaluation of the Associate Editor and Reviewers’ comments.¹Authors are with the National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika, Soraku, Kyoto 619-0289, Japan. [email protected]²Author is with Keio University, 3-14-1 Hiyoshi, Kohoku, Yokohama, Kanagawa 223-8522, Japan. [email protected]³Authors are with Chubu University, 1200 Matsumotocho, Kasugai, Aichi 487-8501, Japan. {[email protected], takayoshi@isc, fujiyoshi@isc}.chubu.ac.jpDigital Object Identifier (DOI): see top of this page.

Abstract

We consider the task of generating segmentation masks for the target object from an object manipulation instruction, which allows users to give open vocabulary instructions to domestic service robots. Conventional segmentation generation approaches often fail to account for objects outside the camera’s field of view and cases in which the order of vertices differs but still represents the same polygon, which leads to erroneous mask generation. In this study, we propose a novel method that generates segmentation masks from open vocabulary instructions. We implement a novel loss function using optimal transport to prevent significant loss where the order of vertices differs but still represents the same polygon. To evaluate our approach, we constructed a new dataset based on the REVERIE dataset and Matterport3D dataset. The results demonstrated the effectiveness of the proposed method compared with existing mask generation methods. Remarkably, our best model achieved a +16.32% improvement on the dataset compared with a representative polygon-based method.

Index Terms:

Novel Deep Learning Methods, Deep Learning for Visual Perception

I Introduction

In modern aging societies, the demand for assistance and support in daily life is increasing; however, there is a feared shortage of home caregivers. As a possible solution, domestic service robots (DSRs) capable of providing physical assistance for caregiving are attracting significant attention. Allowing care recipients to give instructions to DSRs in natural language could greatly increase convenience.

Refer to caption — Figure 1: Overview of our method. Our method generates a polygon-based segmentation mask for the target object of a given instruction and image. We introduce the Polygon Matching Loss. The LLM Paraphraser, SBAE, OVA, VCI, and OTVP are explained in Section IV.

However, such instructions sometimes incorporate out-of-vocabulary words, complex referring expressions and redundant phrases. This complexity makes it challenging for DSRs to understand such instructions and identify target objects.

In this study, we focus on the task of generating segmentation masks of the target object given open vocabulary instructions related to object manipulation. This task is important because it is convenient for users if robots can understand and execute object manipulation based on natural language instructions. For instance, given the instruction, “Go to the living room and bring me the pillow that is closest to the potted plant,” it is required to generate a segmentation mask for the pillow that is closest to the potted plant. Segmentation masks are more desirable than bounding boxes for object manipulation because it is desirable to accurately predict the position and shape of target objects.

Although our target task is closely related to the referring expression segmentation (RES) task[hu2016segmentation], the instructions in our target task often involve two or more sentences. Therefore, it is necessary to identify the object by considering complex relationships between vision and language. Thus, our target task is more challenging than the simple RES task. For instance, consider the instruction “Go downstairs to the open living room with the white fireplace and straighten out the book display next to it.” Referring only to “the book display next to it” is too ambiguous to appropriately identify the target object. In this case, “the white fireplace” indirectly modifies the target object and is important for understanding the instruction.

Although many models [yang2022lavt, wang2021cris, iioka2023mdsm] have been successfully applied to the RES task, most of them do not fully handle multiple sentences. Furthermore, they are also unable to handle the referring expressions of objects that exist outside the camera’s view. Recently, some studies show that proposed polygon-based mask generation methods can achieve shorter inference times compared with traditional pixel-based mask generation methods[peng2020deep, liu2021dance, liu2023polyformer, wang2021cris]. However, most of them cannot account for cases in which the order of vertices differs while still representing the same polygon.

In this study, we propose a model that generates a segmentation mask for the target object specified in a natural language instruction. One of the main differences between our method and existing methods is the introduction of the Polygon Matching Loss (PML), which uses optimal transport for vertex matching. Another significant difference is the introduction of Open-Vocabulary 3D Aggregator (OVA), which handles open-vocabulary multimodal features for objects that exist outside the camera’s field of view.

Introducing PML enables the model to handle cases in which the order of vertices differs while still representing the same polygon. As a result, we train the model to predict the appropriate masks regardless of the vertex order, thereby enabling effective training. Additionally, we expect the OVA to enhance the association of open-vocabulary multimodal features with referring expressions that refer to objects that exist outside the camera’s field of view.

A summary of our key contributions is as follows:

•

To train the model efficiently, we introduce the Optimal Transport Vertex Predictor (OTVP), with the PML, which uses optimal transport for vertex matching.
•

We introduce the OVA to obtain open-vocabulary multimodal features for objects that exist outside the camera’s field of view.
•

We introduce the Segment-Based Attentional Enhancer (SBAE), which uses segmentation images to enhance the understanding of object shapes and their spatial relationships.

II Related Work

In the field of multimodal language processing, numerous studies have been conducted[UPPAL2022149, chen2023vlp, gu2022vision, zhu2023survey] and multimodal large language models (LLMs) have led to notable successes[gpt4v, dai2023instructblip]. Multimodal LLMs have progressed rapidly, and have been successfully applied to task planning [Driess2023palme, brohan2023can] and the generation of code for action sequences [vemprala2023chatgpt, liang2023code]. These approaches have been widely applied in robotics [miyazawa2023survey, xiao2023robot, kawaharazuka2024real, zeng2023large].

Referring expression comprehension (REC) and RES are two major tasks in which models are required to predict specific regions within images based on referring expressions (e.g., [Yu_2018_CVPR, Luo_2020_CVPR, Ye_2019_CVPR]). REC often requires predicting the rectangular regions of target objects given images and referring expressions[kamath2021mdetr, deng2021transvg, UNINEXT]. Therefore, in this study, we focus on segmentation generation tasks rather than REC tasks.

Most RES models predict the pixel-level masks of target objects[yang2022lavt, wang2021cris, huang2020referring, zou2023segment, GRES], whereas some predict the vertices of a polygon that represents the target object[zhu2022seqtr, liu2023polyformer]. Although these polygon-based mask generation methods are similar to our method, they cannot account for cases in which the order of vertices differs but represents the same polygon. Unlike them, we introduce polygon matching with optimal transport in mask generation to achieve efficient training. As a result, despite appropriately predicting the set of vertices, most existing methods do not consider polygons that are similar, which results in the inefficient training of the model.

Additionally, several studies have been conducted aimed at referring expression understanding for DSRs [iioka2023mdsm, kaneda2024learning, korekata2023switch, karamcheti2023voltron]. These tasks involve decomposing high-level instructions into atomic actions and executing them [karamcheti2023voltron, lynch2023interactive, chen2023polarnet], and identifying the target objects specified in the instruction sentences [kaneda2024learning, iioka2023mdsm, homerobot, parashar2023slap]. The authors of [iioka2023mdsm] proposed MDSM, which is a two-stage segmentation model designed to refine masks generated by DDPM [ho2020ddpm]. Unlike MDSM, our method handles information about objects that exist outside the camera’s field of view.

Many datasets for referring expression understanding have been proposed RefCOCO[kazemzadeh2014referitgame], RefCOCO+[yu2016modeling], G-Ref[mao2016generation]. In the field of robotics, datasets that contain natural language instruction sentences [hatori2018interactively, qi2020reverie, shridhar2020alfred] are used for multimodal language understanding tasks. These datasets focus on object manipulation tasks within an indoor environment. [hatori2018interactively, qi2020reverie] are notable studies because they were based on real-world data. In particular, the instruction sentences included in the REVERIE dataset[qi2020reverie] often consist of multiple sentences, which make the task of identifying the target object particularly challenging.

III Problem Statement

In this study, we focus on the task that involves generating the segmentation mask of the target object from an image of the indoor environment, 3D point clouds, and an instruction related to object manipulation. We define this task as the Object Segmentation from Manipulation Instructions-3D (OSMI-3D) task. In this task, the model should generate a segmentation mask for the target object indicated in the instruction. Fig. 1 shows a typical input of the OSMI-3D task. The goal is to generate a mask, which is indicated by the red area, given an instruction such as “Go to the living room and bring me the pillow that is closest to the potted plant.”

We define the inputs and an output as follows:

•

Inputs: an image, 3D point cloud, and an instruction sentences.
•

Output: a pixel-wise segmentation mask of the target object indicated in the instrucion.

In this study, we do not assume cases in which there are multiple target objects or no target object in a single image. We use mean intersection over union (mIoU) and precision as the evaluation metrics.

IV Proposed Method

The proposed method predicts a segmentation mask for the target object referred to in the given object manipulation instructions. Our key contributions are as follows:

•

To train the model efficiently, we introduce the Optimal Transport Vertex Predictor (OTVP), with the PML, which uses optimal transport for vertex matching.
•

We introduce the OVA to obtain the open-vocabulary multimodal features of objects that exist outside the camera’s field of view. It can handle their correspondence with referring expressions.
•

We introduce the SBAE to enhance the understanding of attributes, such as shape and spatial relationships, based on the segmentation images.

Fig. 2 shows the overview of the proposed method. It consists of five main modules: LLM Paraphraser, SBAE, OVA, Visual Context Interpreter (VCI) and OTVP. Our method, particularly the proposed PML, can be widely applied to polygon-based mask generation models[liu2023polyformer, zhu2022seqtr].

The inputs are defined as $\bm{x}=\{\bm{x}_{\mathrm{img}},X_{\mathrm{pcl}},\bm{x}_{\mathrm{inst}}\}$ where $\bm{x}_{\mathrm{img}}\in\mathbb{R}^{3\times H\times W}$ , $X_{\mathrm{pcl}}=\{\bm{\xi}_{i}\mid i=0,1,2,\ldots,N_{\mathrm{pcl}}\}$ and $\bm{x}_{\mathrm{inst}}\in\{0,1\}^{v\times l}$ denote an image, 3D point clouds and an instruction sentences tokenized as a one-hot vector, respectively. Note that $H$ , $W$ , $\bm{\xi}_{i}$ , $N_{\mathrm{pcl}}$ , $v$ and $l$ denote the height of the image, width of the image, $i$ -th point, total number of points in a point cloud, vocabulary size and max token length of the instruction, respectively.

IV-A LLM Paraphraser

Unlike the RES task, the OSMI-3D task often involves two or more sentences. As shown later, typical RES models do not handle such cases and may focus on phrases unrelated to the target object. To improve the understanding of phrases associated with the target object, we introduce the LLM Paraphraser. Specifically, the LLM Paraphraser combines several sentences and summarizes referring expressions related to the target object into a single sentence.

LLM Paraphraser takes $\bm{x}_{\mathrm{inst}}$ as input. It summarizes $\bm{x}_{\mathrm{inst}}$ using an LLM (GPT-3.5-turbo[gpt35turbo]). For example, when $\bm{x}_{\mathrm{inst}}$ is “Go to the dining table. Then pick up the candle on the right, ” the sentence “Pick up the right candle on the dining table.” is obtained. We embed the sentence into the language features $\bm{h}_{\mathrm{llp}}\in\mathbb{R}^{d_{\mathrm{llp}}}$ using the text-embedding-ada-002[adatxtembeddingada002], where $d_{\mathrm{llp}}$ is the number of feature dimensions. The output of LLM Paraphraser is $\bm{h}_{\mathrm{llp}}$ .

IV-B Visual Context Interpreter

Previous RES studies can be mainly divided into two approaches for extracting image features: using image encoders (e.g., ResNet[He2015DeepRLresnet] and ViT[dosovitskiy2020image]) to extract visual features such as texture and edges; and using multimodal image encoders (e.g., CLIP[radford2021learning], UNITER [chen2020uniter], and BLIP[li2022blip]) to extract multimodal image features that are aligned with natural language. However, these features sometimes lack visual representations related to complex referring expressions (e.g., “the second chair from the left in the first floor dining room that has the mirror hanging above the fireplace”) and spatial relationships (e.g., “the hand towel on the towel rack to the left of the sink”).

To handle such complex visual representations, we introduce the VCI. In the VCI, multimodal LLMs generate descriptions that include details such as the attributes of objects, their spatial relationships, and their complex relationships in referring expressions. Furthermore, using multimodal LLMs, we can obtain additional common-sense knowledge that is not contained in the image alone. For example, if the scene shows an open door and the outside is visible through the door, it is highly likely to be an entrance.

The inputs of VCI are $\bm{x}_{\mathrm{img}}$ and $\bm{x}_{\mathrm{inst}}$ , and the output is the intermediate feature $\bm{h}_{\mathrm{vci}}\in\mathbb{R}^{d_{\mathrm{vci}}}$ , where $d_{\mathrm{\mathrm{vci}}}$ represents the dimension. First, we obtain a description of $\bm{x}_{\mathrm{img}}$ using a multimodal LLM (gpt-4-vision-preview [gpt4v]). We embed the description into $\bm{h}_{\mathrm{vci}}$ using the text-embedding-ada-002.

IV-C Open-Vocabulary 3D Aggregator

Existing methods [iioka2023mdsm, yang2022lavt, zhu2022seqtr] often fail to identify the target object given the referring expressions of objects outside the field of view. To address this issue, we introduce the OVA to enhance the understanding of referring expressions for objects outside the field of view. This module aligns 3D point clouds with open-vocabulary multimodal features and links them to referring expressions. As a result, it is expected to obtain information about objects outside the camera’s field of view without the need to capture images from various angles.

In this module, the input is $X_{\mathrm{pcl}}$ and the output is the intermediate feature $\bm{h}_{\mathrm{ova}}\in\mathbb{R}^{d_{\mathrm{ova}}}$ , where $d_{\mathrm{ova}}$ denotes the dimensionality. First, from $X_{\mathrm{pcl}}$ , we extract the $N_{\mathrm{near}}$ points closest to the position where $\bm{x}_{\mathrm{img}}$ was captured. This subset is denoted by $X_{\mathrm{near}}$ . We use only $N_{\mathrm{near}}$ points because referring expressions often refer to objects around the target object, and using points at remote locations is not efficient. We obtain multimodal features $\bm{h}^{\prime}_{\mathrm{ova}}=f\left(X_{\mathrm{near}}\right)$ . Note that, $f\left(\cdot\right)$ denotes the application of pre-trained OpenScene[Peng2023OpenScene]. OpenScene embeds multimodal features into each point of a 3D point cloud using CLIP[radford2021learning]. Finally, we obtain the feature $\bm{h}_{\mathrm{ova}}=\mathrm{MaxPool}\left(\mathrm{Upsample}\left(\bm{h}^{% \prime}_{\mathrm{ova}}\right)\right)$ where $\mathrm{MaxPool}\left(\cdot\right)$ and $\mathrm{Upsample}\left(\cdot\right)$ denotes upsampling process and max pooling, respectively.

IV-D Segment-Based Attentional Enhancer

Existing RES and OSMI models sometimes inappropriately predict the contours of objects. To address this, we introduce the SBAE to enhance the understanding of segment information related to target objects.

Fig. 3 shows the structure of the SBAE. It extracts image features at multiple resolutions from the $\bm{x}_{\mathrm{img}}$ and a segmentation image, and then integrates these features with $\bm{h}_{\mathrm{llp}}$ , $\bm{h}_{\mathrm{vci}}$ , and $\bm{h}_{\mathrm{ova}}$ from each module. The inputs to this module are $\bm{x}_{\mathrm{img}}$ , $\bm{h}_{\mathrm{llp}}$ , $\bm{h}_{\mathrm{vci}}$ and $\bm{h}_{\mathrm{ova}}$ . First, we use the pre-trained SAM[Kirillov_2023_ICCV] to obtain a segmentation image $\bm{s}\in\mathbb{R}^{3\times H\times W}$ given $\bm{x}_{\mathrm{img}}$ . As shown in the figure, we obtain image features $\{\bm{V}_{k}\in\mathbb{R}^{H_{k}\times W_{k}\times C_{k}}\}_{k=1}^{M_{v}}$ from intermediate layers at $M_{v}$ different resolutions using DarkNet-53[Wang_2021_CVPR] pre-trained on the MS-COCO[lin2014microsoft] dataset.

Similarly, we obtain image features $\left\{\bm{V}^{\prime}_{l}\in\mathbb{R}^{H_{l}\times W_{l}\times C_{l}}\right% \}^{M_{s}}_{l=1}$ from $M_{s}$ types of intermediate layers from $\bm{s}$ . Note that $H$ and $W$ denote the height and width of the image feature maps of $\bm{V}_{k}$ and $\bm{V}^{\prime}_{l}$ , respectively, and $C_{i}$ denotes the number of channels of $\bm{V}_{k}$ and $\bm{V}^{\prime}_{l}$ .

As shown in Fig. 3, we downsample each obtained $\bm{V}_{k}$ and $\bm{V}^{\prime}_{l}$ , and then obtain $\bm{V}_{\mathrm{mix}}$ by concatenating them in the channel dimension. Furthermore, we obtain $\bm{h}_{\mathrm{mix}}$ by concatenating $\bm{h}_{\mathrm{llp}}$ , $\bm{h}_{\mathrm{vci}}$ and $\bm{h}_{\mathrm{ova}}$ in the channel dimension, and $\bm{h}_{\mathrm{mix}}$ is downsampled. Finally, we compute the cross-attention between $\bm{V}_{\mathrm{mix}}$ and $\bm{h}_{\mathrm{mix}}$ to obtain the intermediate feature $\bm{S}_{a}=f_{a}(\bm{V}_{\mathrm{mix}},\bm{h}_{\mathrm{mix}})$ . Note that $f_{a}\left(\cdot,\cdot\right)$ denotes the cross-attention function. Furthermore, $f_{a}\left(\cdot,\cdot\right)$ for any matrices $\bm{X}_{A}$ and $\bm{X}_{B}$ is defined as follows:

\displaystyle f_{a}(\bm{X}_{A},\bm{X}_{B})

\displaystyle=\mathrm{softmax}\left(\frac{(\bm{W}_{q}\bm{X}_{A})(\bm{W}_{k}\bm% {X}_{B})^{\top}}{\sqrt{d}}\right)(\bm{W}_{v}\bm{X}_{B}),

where $\bm{W}_{q}$ , $\bm{W}_{k}$ , and $\bm{W}_{v}$ are trainable weights, and $d$ is a scaling factor.

IV-E Optimal Transport Vertex Predictor

The OTVP takes $\bm{S}_{a}$ as input. Output is $\hat{Y}=\{\hat{\bm{y}}_{i}\mid i=1,2,\ldots,N^{\prime}\}$ , where $\hat{\bm{y}}_{i}$ denotes the coordinate of a vertex. The OTVP consists of a transformer encoder and transformer decoder. The encoder and decoder consist of $n_{\mathrm{enc}}$ and $n_{\mathrm{dec}}$ transformer layers, respectively. By inputting $\bm{S}_{a}$ into this encoder-decoder and applying linear projection, we obtain $\hat{Y}$ .

Existing methods[zhu2022seqtr, liu2023polyformer, liu2021dance, peng2020deep] often fail to account for cases in which the order of vertices differs but represents the same polygon. The bottom-right diagram and bottom-left diagram in Fig. 1 show an example. The green and red shapes in the figure represent the predicted and correct masks, respectively. In the bottom-left diagram in Fig. 1, the two sets of vertices represent polygons of the same shape. However, the order of vertices is different. Despite appropriately predicting the set of vertices, most existing methods do not consider the polygons to be similar, which results in a significant loss. This can lead to inefficient training of the model.

We introduce the PML $\mathcal{L}_{OT}$ to effectively address such cases using optimal transport. As shown in the bottom-right diagram in Fig. 1, it involves matching between $\hat{Y}$ and the set of vertices of reference mask $Y=\{\bm{y}_{j}\mid j=1,2,\ldots,N\}$ using optimal transport, where $N$ represents the number of vertices in the reference mask’s set of vertices.

When $\hat{Y}$ and $Y$ are given, we can identify a transportation plan that transfers $\hat{Y}$ to $Y$ at the minimum transportation cost. We regard $\hat{Y}$ and $Y$ as two discrete distributions $\alpha=\sum_{i=1}^{N^{\prime}}\bm{a}_{i}\delta_{\hat{\bm{y}}_{i}}$ and $\beta=\sum_{j=1}^{N}\bm{b}_{j}\delta_{\bm{y}_{j}}$ , respectively. Note that $\delta_{\hat{\bm{y}}_{i}}$ and $\delta_{\bm{y}_{j}}$ represent the Dirac delta function centered on $\hat{\bm{y}}_{i}$ and $\bm{y}_{j}$ , respectively. The weight vectors $\bm{a}$ and $\bm{b}$ are normalized. Using this, we define $\mathcal{L}_{OT}$ as follows:

\displaystyle\mathcal{L}_{OT}=\min_{\bm{P}\in\mathcal{U}(\bm{a},\bm{b})}\sum_{% i=1}^{N^{\prime}}\sum_{j=1}^{N}C(\hat{\bm{y}}_{i},\bm{y}_{j})P_{ij},

(1)

\displaystyle\mathcal{U}(\bm{a},\bm{b})=\left\{\bm{P}\in\mathbb{R}^{{N^{\prime% }}\times{N}}\mid P_{ij}\geq 0,\bm{P}\mathbf{1}_{N^{\prime}}=\bm{a},\bm{P}^{% \top}\mathbf{1}_{N}=\bm{b}\right\},

where $\bm{1}_{N^{\prime}}$ and $\bm{1}_{N}$ denote vectors of dimension $N^{\prime}$ and $N$ , respectively, with all components equal to 1. Furthermore, $C(\bm{\hat{y}}_{i},\bm{y}_{j})=\|\bm{\hat{y}}_{i}-\bm{y}_{j}\|_{2}$ and ${P_{ij}}$ denote the transportation cost and transportation plan from $\bm{\hat{y}}_{i}$ to $\bm{y}_{j}$ , respectively, where $||\cdot||_{2}$ denotes the $L^{2}$ norm. In this study, we compute Equation (1) efficiently with entropy regularization and subsequently apply the Sinkhorn algorithm[cuturi2013sinkhorn].

TABLE I: Comparison with baseline methods on the SHIMRIE-3D dataset. The bold numbers represent the highest values for each metric.

Model	mIoU [%] $\uparrow$	[email protected] [%] $\uparrow$	[email protected] [%] $\uparrow$
LAVT[yang2022lavt]	28.16 $\pm$ 2.85	26.46 $\pm$ 4.01	18.75 $\pm$ 3.29
SeqTR [zhu2022seqtr]	21.84 $\pm$ 2.28	17.87 $\pm$ 7.00	5.16 $\pm$ 5.26
MDSM[iioka2023mdsm]	24.36 $\pm$ 3.87	22.49 $\pm$ 5.46	13.71 $\pm$ 3.34
Ours	38.16 $\pm$ 2.46	48.85 $\pm$ 2.70	22.29 $\pm$ 3.32

TABLE II: Quantitative results on ablation studies. The bold numbers represent the highest values for each metric. The columns labeled VCI, OVA, SBAE, and PML indicate whether each module is included, as indicated by a check mark.

Model	VCI	OVA	SBAE	PML	mIoU [%] $\uparrow$	[email protected] [%] $\uparrow$	[email protected] [%] $\uparrow$
(i)		✓	✓	✓	35.27 $\pm$ 5.41	45.31 $\pm$ 7.64	19.48 $\pm$ 4.99
(ii)	✓		✓	✓	37.36 $\pm$ 2.55	48.11 $\pm$ 4.13	27.24 $\pm$ 4.99
(iii)	✓	✓		✓	31.77 $\pm$ 0.92	37.86 $\pm$ 2.06	14.00 $\pm$ 4.28
(iv)	✓	✓	✓		33.07 $\pm$ 3.44	41.04 $\pm$ 6.74	20.42 $\pm$ 8.18
(v)	✓	✓	✓	✓	38.16 $\pm$ 2.46	48.85 $\pm$ 2.70	22.29 $\pm$ 3.32

V Experimental Results

V-A Quantitative Results

Table I shows the quantitative results of the comparison of the baseline methods and proposed method. We conducted the experiments five times each. The averages and standard deviations of mIoU and P@ $k$ ( $k$ =0.5, 0.7) are shown in the table. Furthermore, the bold numbers in Table I represent the highest values for each metric.

As the baseline methods, we used MDSM[iioka2023mdsm], LAVT[yang2022lavt], and SeqTR[zhu2022seqtr]. We chose them as baseline methods for the following reasons. We used MDSM because it has been successfully applied to the OSMI task. Furthermore, we used LAVT and SeqTR because these have been successfully applied to the RES task, which is closely related to the OSMI-3D task.

We used mIoU and Precision at $k$ (P@ $k$ ) as the evaluation metrics because they are standard metrics in RES tasks, closely associated with the OSMI-3D task. We chose mIoU as the primary metric. mIoU is defined as $\mathrm{mIoU}=({1}/N_{s})\sum_{i=1}^{N_{s}}|Y_{i}\cap\hat{Y}_{i}|/|Y_{i}\cup% \hat{Y}_{i}|$ , where $N_{s}$ , $\hat{Y}_{i}$ , and $Y_{i}$ denote the number of samples, and the sets of pixels corresponding to the predicted mask and GT mask, respectively, in the $i$ -th sample. P@ $k$ is defined as $\mathrm{P}@k={T_{k}}/N_{s}$ , where $T_{k}$ denotes the number of samples for which the IoU between a predicted mask and a GT mask exceeded the threshold $k$ .

Table I shows that the proposed method achieved an mIoU of 38.16%, whereas LAVT, SeqTR, and MDSM were 28.16%, 21.84%, and 24.36%, respectively. The proposed method outperformed the best result for the baseline methods, which was obtained by LAVT, by 10.00 points. Table I also shows that [email protected] for LAVT, SeqTR, MDSM, and the proposed method were 26.46%, 17.87%, 22.49%, and 48.85%, respectively. From the above, the proposed method outperformed the highest performing LAVT in terms of [email protected] by 22.39 points. Similarly, the proposed method also outperformed the baseline methods in termes of [email protected].

V-B Qualitative Results

Fig. 4 shows the qualitative results. In the figure, columns (a) and (b) represent $\bm{x}_{\mathrm{img}}$ and GT, respectively. Additionally, columns (c), (d), (e), and (f) represent the masks predicted by LAVT, SeqTR, MDSM and the proposed method, respectively. Fig. 4 (i)(ii) show successful examples.

Fig. 4 (i) shows an example in which the instruction was “In the 3rd level bathroom, there is a box of tissues to the left of the basin. Please fetch them here.” In this example, neither LAVT nor MDSM generated any masks, whereas SeqTR incorrectly masked the magazine. Conversely, the proposed method appropriately generated a mask for the tissue box, which demonstrates that it successfully identified the target object specified in the instruction. In the example from Fig. 4 (ii), the instruction was “Walk to the living room and fetch me the leftmost pillow on the smaller white sofa, the pillow closest to the plant on the small table.” In this case, LAVT and MDSM also did not generate masks for the object, and SeqTR generated a mask for a different object on the table. By contrast, the proposed method appropriately generated a mask for the beige cushion. We consider that the proposed method is capable of understanding referring expressions related to color and spatial relationships.

Fig. 4 (iii) illustrates a failure example. Fig. 4 (iii) shows an example with the instruction sentence “Go to the closet in the bedroom with the orange comforter and bring me the second hanger on top.” In this example, both LAVT and MDSM generated an under-segmented mask for two hangers, and SeqTR masked unrelated areas. Our method masked the hanger on the right-hand side, but the target object specified in the instruction sentence was the left hanger. In this example, the phrase “second hanger” in the instruction sentence was ambiguous, which made it difficult to select a single target object.

V-C Ablation Studies

We set the following four conditions for the ablation studies:

VCI ablation We removed the VCI and assessed the contributions. Table II shows that the mIoU in for Model (i) was 35.27%, which was 2.89 points lower than that for Model (v). [email protected] and [email protected] for Model (i) were also lower than that for Model (v). From the above, VCI contributed to the improvement of performance. This indicates that VCI enhanced the understanding of referring expressions.

OVA ablation We investigated the performance of OVA by removing it. Table II indicates that the mIoU for Model (ii) was 37.36%, which was 0.8 points lower than that for the Model (v). [email protected] also decreased. By contrast, [email protected] increased by 4.95 points. This may indicate that information about objects that exist outside the camera’s field of view was obtained by the OVA. However, using the OVA to obtain information around objects or the environment may inadvertently lead to focusing on nouns other than the target object. For example, Fig. 5 shows an example where the instruction sentence was “Go to the bathroom on level 1 and bring me the picture furthest to the left” for which the mask generated by Model (v) was slightly skewed toward the sink. This is likely to have occurred because the model excessively focused on the word ‘bathroom’ in the instruction, influenced by the feature of a sink related to ‘bathroom,’ which was obtained through the OVA. As a result, it is possible that the model could not focus strongly on the word ‘picture,’ which was the target object.

SBAE ablation To investigate the effectiveness of the SAM module in the SBAE, we removed it. Table II illustrates that Model (iii) achieved an mIoU of 31.77%, which was 6.39 points lower than that of Model (v). It also scored lower in terms of [email protected] and [email protected]. This suggests that the SBAE enhanced the understanding of segment information about objects, thereby enabling the more appropriate prediction of object contours.

PML ablation We investigated the implications for the performance of PML. Table II shows that the mIoU for Model (iv) was 33.07%. This result was 5.09 points lower than that for Model (v) and it was also lower in terms of [email protected] and [email protected]. This indicates that effective training was achieved using PML.

V-D Error Analysis

In this study, we defined failed cases as those with an IoU lower than 0.5. Based on this definition, the proposed method failed on 179 test samples. We analyzed 100 samples with the lowest IoU values out of 179 samples of failure cases. Table III describes the categories of the failure cases.

We roughly divided the cases into five types:

(a)

Serious comprehension error
This category includes failure cases in which our model incorrectly segmented a large part of objects that were not mentioned in the instruction. For example, our model incorrectly segmented ‘wall’ given the instruction “Clean the decoration on the table.” This is presumably because our model failed to align the image and language.
(b)

Reference/exophora resolution error
This category represents cases in which our model incorrectly segmented objects in the same category that were different from the target object. For instance, our model improperly segmented “the picture on the right-hand side” following the instruction “Bring the leftmost picture on the wall.” This is presumably because of our model’s failure to understand the referring expressions appropriately.
(c)

Segmentation of non-target objects
This category refers to cases in which our model segmented non-target objects in the instructions. For example, ‘bed’ was segmented given the instruction “Fetch me a pillow on the bed.”
(d)

Hallucination in VCI
The cases in this category involve multimodal LLM in VCI inappropriately describing the appearance and position of objects or non-existent objects. An example of this is, when there was no cushion in the room, the multimodal LLM generated the sentence “The cushion on the left is white.”
(e)

Ambiguous instruction
This category refers to cases in which the instructions included ambiguous expressions about the name or location of the target object, which made it difficult to identify the target object. Suppose that the instruction “Please bring the second hanger” was provided and the image contained multiple hangers. It would remain unclear which hanger was being referred to.

TABLE III: Error analysis on failure cases.

Errors

# of Errors

Serious comprehension error

Reference/exophora resolution error

Segmentation of non-target objects

Hallucination in VCI

Ambiguous instruction

Total

100

Table III indicates that the main bottlenecks were (a) and (b). We consider that the reason for the former was that the model failed to ground referring expressions with their corresponding target objects. As a solution, we may be able to use SEEM [zou2023segment]. SEEM performs open vocabulary panoptic segmentation, where textual features and visual prompt features are aligned in a joint visual-semantic space. To avoid focusing on irrelevant stuff or things, we can consider masking them out using this semantic labeling approach. Furthermore, for the latter, the model often misunderstood the spatial relationships between a target object and its surroundings. Additionally, there were some cases in which VCI failed to clearly describe the positional relationships between objects. Therefore, a solution may be to improve the prompt so that it focuses on spatial relationships.

VI Conclusions

In this study, we focused on the OSMI-3D task, where models generated segmentation masks of the target object given an image of the indoor environment, 3D point clouds, and an instruction sentence related to object manipulation. Our method outperformed the baseline methods on all standard metrics in the OSMI-3D task. For future research, we plan to implement a semantic labeling approach to mask out irrelevant stuff, thereby ensuring that the focus remains on pertinent things.

ACKNOWLEDGMENT

This work was partially supported by JSPS KAKENHI Grant Number 20H04269, JST Moonshot, JST CREST, and NEDO.

Alleviating the Burden of Labeling: Sentence Generation by Attention Branch Encoder–Decoder Network