Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation

Zihan Gao      Lingling Li      Licheng Jiao      Fang Liu      Xu Liu      Wen** Ma
Yuwei Guo      Shuyuan Yang School of Artificial Intelligence, Xidian University
Corresponding author.
Abstract

Understanding 3D scenes is a crucial challenge in computer vision research with applications spanning multiple domains. Recent advancements in distilling 2D vision-language foundation models into neural fields, like NeRF and 3DGS, enables open-vocabulary segmentation of 3D scenes from 2D multi-view images without the need for precise 3D annotations. While effective, however, the per-pixel distillation of high-dimensional CLIP features introduces ambiguity and necessitates complex regularization strategies, adding inefficiencies during training. This paper presents MaskField, which enables fast and efficient 3D open-vocabulary segmentation with neural fields under weak supervision. Unlike previous methods, MaskField distills masks rather than dense high-dimensional CLIP features. MaskFields employ neural fields as binary mask generators and supervise them with masks generated by SAM and classified by coarse CLIP features. MaskField overcomes the ambiguous object boundaries by naturally introducing SAM segmented object shapes without extra regularization during training. By circumventing the direct handling of high-dimensional CLIP features during training, MaskField is particularly compatible with explicit scene representations like 3DGS. Our extensive experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence, outperforming previous methods with just 5 minutes of training. We hope that MaskField will inspire further exploration into how neural fields can be trained to comprehend 3D scenes from 2D models.

1 Introduction

Recent developments in 3D scene representation, notably Neural Radiance Fields (NeRF)  [1] and 3D Gaussian Splatting (3DGS)  [2], have paved the way for learning-based methods to recover 3D geometry from merely posed 2D images. These advances have spurred research into leveraging 2D foundation models, contextualized by neural fields, to comprehend 3D scenes from multi-view 2D images [3, 4, 5, 6]. Utilizing pre-trained models like CLIP  [7], these methods enable 3D open-vocabulary semantic segmentation without actual 3D data and annotation. It holds significant potential in diverse applications such as robot navigation  [8], autonomous driving  [9, 10], and urban planning  [11, 12].

However, as CLIP is trained at the image level, distilling its features for per-pixel segmentation necessitates the extraction of pixel-aligned features. This requirement inevitably faces ambiguity inherently in CLIP. To better understand the ambiguity in CLIP features, we visualize features extracted by various methods and present them in Fig. 1. These observations suggest that per-pixel distillation of CLIP features is inherently challenging and requires additional regularization strategies.

To address the ambiguity in CLIP features, previous 3D open-vocabulary semantic segmentation with neural fields typically follows a complex framework. Initially, a multi-scale feature pyramid containing CLIP embeddings from image crops at different scales or regions is extracted for each view. Subsequently, a neural feature field is trained by distilling these multi-scale CLIP features. During this phase, off-the-shelf vision foundation models such as DINO [13] or SAM [14] are used to refine the neural feature field for object boundaries. The final segmentation map is then obtained by segmenting this field with text features from the CLIP text encoder. While effective, it is highly inefficient due to the need to manage dense, high-dimensional CLIP features and complex regularization during training. For example in [4], training a small-scale scene with 28 views can require up to 64GB of RAM and 0.5 hours just to load preprocessed CLIP features, with an additional 1.3 hours of training on an RTX 3090 GPU. Such inefficiencies severely restrict these methods’ practical applicability in real-world settings. Given this, a natural question emerges: can we avoid the per-pixel distillation of CLIP features and still enable accurate segmentation?

Refer to caption
Figure 1: (a) Extracting pixel-aligned CLIP features from image crops shows ambiguity around object boundary. (b) SAM crops can provide object boundaries but show ambiguity due to a lack of contextual information and scale. (c) Dense features directly extracted from CLIP ViT’s feature map show inherent ambiguity in different regions. (d) Previous method extracts dense CLIP features and performs distillation at pixel-level. (e) Our proposed MaskField performs mask-level distillation to avoid handling the high-dimensional ambiguous CLIP feature during training.

In this paper, we adopt a novel perspective to address these challenges, drawing inspiration from prior research [15, 16]. We introduce MaskField, a novel method that avoids the inefficiencies of per-pixel CLIP feature distillation. Instead of performing per-pixel distillation, we employ a neural field as a scene-level binary mask generator for mask-level distillation. Specifically, MaskField utilizes a neural field to represent mask features in 3D. These features are rendered using techniques tailored to neural fields, ensuring multi-view consistency for per-pixel mask features. Additionally, we introduce a set of scene-level learnable query and class token pairs. These tokens are designed to be invariant of 3D position or viewing directions, facilitating consistent predictions across different views. Each token pair represents a specific mask in the scene: the query token aids in binary mask prediction through a dot product with the rendered mask features, while the class token denotes the associated mask class. The predicted masks are supervised using a set of masks generated by SAM and classified based on coarse CLIP features extracted by [17, 18].

Distillation at a mask level adds additional benefits: First, distilling binary masks eliminates the need for high-dimensional CLIP features during training. MaskField simplifies processing by handling shape and class separately through mask query tokens and class tokens. MaskField enables faster training at a relatively low resolution (friendly to NeRF) and low mask feature dimension (friendly to 3DGS). Second, mask-level distillation naturally enables objects boundaries segmented by SAM without additional regularization during training. As a result, MaskField achieves superior results compared to previous state-of-the-art methods and can achieve comparable performance with only 5 minutes of training (a speed up of approximately 19.5 times compared to  [4]).

2 Related work

Neural Feature Fields. Researchers have developed learning-based methods to represent 3D scenes, primarily categorized into Neural Radiance Field (NeRF) [1, 19, 20] and 3D Gaussian Splatting (3DGS) [2, 21]. NeRF uses a multi-layer perceptron (MLP) to implicitly represent scenes by modeling 3D coordinates and viewing directions, producing corresponding RGB and volume density values. Conversely, 3DGS employs explicit 3D Gaussians for scene representation. Both methods can be adapted to train a feature field that generates high-dimensional feature vectors by distilling pre-trained vision models, enhancing 3D semantic understanding.

Distilled Feature Fields [22] initially explored distilling pixel-aligned feature vectors from LSeg [23], but this approach struggles with generalization in long-tail distribution classes. LERF [3] proposed extracting a multi-scale feature pyramid from image crops to distill CLIP features for open-vocabulary segmentation. Extending this idea, 3D-OVS [4] introduced multi-scale and multi-spatial strategies to adapt CLIP’s image-level features for pixel-level segmentation, adding two regularization terms to mitigate ambiguities in CLIP features. Building on the success of 3DGS in novel view synthesis, Langsplat [5] and Feature GS [6] uses 3DGS to create 3D representations. Langsplat also advocat for the use of the hierarchy defined by the Segment Anything Model (SAM) to address CLIP feature ambiguities.

Although these methods show impressive results by perform pixel-level distillation of CLIP features into a neural feature field, constructing pixel-aligned CLIP features is inherently challenging due to their ambiguities. In this work, we perform mask-level distillation by leveraging neural fields as scene-level mask generators to circumvent directly addressing these ambiguities.

Open-vocabulary Segmentation. In recent years, open-vocabulary segmentation has seen considerable growth, fueled by the widespread availability of extensive text-image datasets and advanced pre-trained models like CLIP [7]. Researchers have approached problem from different perspectives. A direct method [23] leverages the capabilities of CLIP to align pixel-level visual features with its text embeddings. More recent studies [24, 16, 25, 26, 27] have focused on using class-agnostic mask generators that classify masks via a parallel CLIP branch. Another emerging approach [28, 29] incorporates learnable tokens or adapter layers to predict masks directly using a frozen CLIP. However, these methods, trained at the image level, lead to inconsistent segmentation across multiple views of a 3D scene.

Drawing on advancements in 2D open-vocabulary segmentation, MaskField introduces neural fields as a scene-level mask generator, performing mask-level distillation of CLIP features. Our approach leverages the inherent multi-view consistency of neural fields and transfers the open-vocabulary capabilities of CLIP into 3D environments.

Foundation Models Pre-trained foundation models [13, 7, 30, 14] have become a cornerstone in the field of computer vision. These models are foundational for develo** a comprehensive understanding of visual content, trained on vast datasets with a high number of parameters. For instance, CLIP [7, 30] combines an image encoder with a text encoder, utilizing an image-text contrastive learning strategy to form associations between images and their corresponding text descriptions from large-scale image-text data [31]. Demonstrating significant zero-shot capabilities [30], CLIP also integrates well with other modules, enhancing various downstream tasks [23, 32]. Beyond CLIP, the Segment Anything Model (SAM) [14] serves as a foundation model specifically for image segmentation. SAM is trained on an extensive dataset of 11 million images and 1 billion masks, enabling it to generate high-quality, class-agnostic region proposals that perform robustly across different applications [33, 34, 35, 36].

While CLIP and SAM show impressive capabilities in 2D image understanding, training such a model in 3D is challenging due to the expensive 3D annotations and unstructured scene representation. In this work, we distill the capabilities of SAM and CLIP into 3D to enable efficient 3D open-vocabulary semantic segmentation.

3 Method

Refer to caption
Figure 2: An overview of the proposed MaskField. Given a set of multi-view images, our method distills the open-vocabulary knowledge from CLIP at a mask level. Our method does not need to handle high-dimensional CLIP features during training. Furthermore, our method naturally introduces region boundaries segmented by SAM without the need for complex regularization during training.

Given a set of multi-view 2D images and corresponding class descriptions, our objective is to segment the reconstructed neural field such that each 3D point is assigned a relevant class label. Previously, this problem has been addressed by associating each 3D point with a CLIP feature to represent its semantic meaning [3, 4]. As CLIP only generates image-level features, extracting pixel-aligned CLIP features presents challenges. Departing from these approaches, we introduce a method to avoid the inefficiencies associated with the inherent ambiguities of CLIP features by segmenting the neural field at a mask level. An overview of MaskField is presented in Fig. 2.

In this section, we first revisit the challenges of modeling neural fields for 3D open-vocabulary semantic segmentation and highlight the key factors contributing to inaccuracy and inefficiency. We then elaborate on the proposed MaskField, demonstrating how it effectively addresses these challenges.

3.1 Revisiting Neural Fields for 3D Open-vocabulary Semantic Segmentation

Given a set of N𝑁Nitalic_N calibrated multi-view images, a neural field is trained to represent the 3D scene. This exploration aims to understand the inherent inefficiencies in current learning paradigms. The specific type of neural field is not crucial for this discussion. For ease of reference, we denote the neural field as Φ(𝐱,𝐝)(𝐜,σ)Φ𝐱𝐝𝐜𝜎\Phi(\mathbf{x},\mathbf{d})\rightarrow(\mathbf{c},\sigma)roman_Φ ( bold_x , bold_d ) → ( bold_c , italic_σ ), where 𝐱𝐱\mathbf{x}bold_x represents the 3D coordinates, 𝐝𝐝\mathbf{d}bold_d the viewing direction, 𝐜𝐜\mathbf{c}bold_c the color, and σ𝜎\sigmaitalic_σ the density. For a pixel u𝑢uitalic_u, the color can be rendered in 2D as follows:

𝐂^(u)=iTiαi𝐜i,Ti=Πj=0i1(1αi),αi=1exp(δiσi).formulae-sequence^𝐂𝑢subscript𝑖subscript𝑇𝑖subscript𝛼𝑖subscript𝐜𝑖formulae-sequencesubscript𝑇𝑖superscriptsubscriptΠ𝑗0𝑖11subscript𝛼𝑖subscript𝛼𝑖1subscript𝛿𝑖subscript𝜎𝑖\displaystyle\mathbf{\hat{C}}(u)=\sum_{i}T_{i}\alpha_{i}\mathbf{c}_{i},T_{i}=% \Pi_{j=0}^{i-1}(1-\alpha_{i}),\alpha_{i}=1-\exp(-\delta_{i}\sigma_{i}).over^ start_ARG bold_C end_ARG ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Π start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 - roman_exp ( - italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (1)

Here, Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the accumulated transmittance, and αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the opacity of the point. Building on this, most existing methods [22, 3, 4, 5] construct a neural feature field, denoted as Ψf(𝐱,𝐝)(𝐜,σ,𝐟)subscriptΨ𝑓𝐱𝐝𝐜𝜎𝐟\Psi_{f}(\mathbf{x},\mathbf{d})\rightarrow(\mathbf{c},\sigma,\mathbf{f})roman_Ψ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( bold_x , bold_d ) → ( bold_c , italic_σ , bold_f ), which not only models the scene’s geometry but also integrates high-dimensional semantic features vector 𝐟𝐟\mathbf{f}bold_f. The feature map can be rendered in 2D similarly to the color:

𝐅^(u)=iTiαi𝐟i.^𝐅𝑢subscript𝑖subscript𝑇𝑖subscript𝛼𝑖subscript𝐟𝑖\displaystyle\mathbf{\hat{F}}(u)=\sum_{i}T_{i}\alpha_{i}\mathbf{f}_{i}.over^ start_ARG bold_F end_ARG ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (2)

This enriched representation enables the feature distillation of the CLIP [7] image encoder based on the geometry provided by neural field. However, the fundamental design of CLIP, which focuses on generating image-level features FvisualDsubscript𝐹𝑣𝑖𝑠𝑢𝑎𝑙superscript𝐷F_{visual}\in\mathbb{R}^{D}italic_F start_POSTSUBSCRIPT italic_v italic_i italic_s italic_u italic_a italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, introduces a challenge when precise neural feature field distillation requires pixel-aligned features 𝐅D×H×W𝐅superscript𝐷𝐻𝑊\mathbf{F}\in\mathbb{R}^{D\times H\times W}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_H × italic_W end_POSTSUPERSCRIPT. Also, the high-dimensional CLIP feature introduce large memory usage and slow rasterization with explicit neural fields such as 3DGS.

As illustrated in Fig. 1, employing CLIP for pixel-aligned distillation inevitably confronts the issue of feature ambiguity. To bridge this gap between image and pixel-aligned requirements, mainstream methods segment the image into patches [3, 4] or smaller regions [5] provided by SAM [14]. This approach transforms the broad, image-wide CLIP features into a more localized form that can be directly applied at the pixel level. Also, recent methods introduce autoencoders [5] or upsamplers [6] to reduce the feature dimension.

Although pixel-level distillation of CLIP features into a neural field has shown effectiveness, there are two main factors contributing to its inaccuracy and inefficiency. First, the inherent ambiguity in CLIP features complicates achieving precise segmentation. This necessitates complex regularization strategies to clarify object boundaries, which in turn increases the computational load during training. Second, the per-pixel distillation of high-dimensional, pixel-aligned CLIP features often leads to inefficiency, placing significant demands on system resources. This limitation restricts scalability and practical application in larger scenes.

Due to the inaccuracy and inefficiency associated with the per-pixel distillation of CLIP feaytures, we explore a mask-level strategy that avoids distilling pixel-aligned CLIP features and demonstrates to be more effective and efficient.

3.2 MaskField Formulation

We introduce MaskField, a novel method for 3D open-vocabulary semantic segmentation using neural fields. Unlike previous approaches that perform per-pixel distillation of CLIP image features into a neural feature field, MaskField operates at the mask level. This method segments each object by generating and classifying comprehensive masks, thereby avoiding the complexities associated with the ambiguous high-dimensional CLIP features.

We formulate MaskField with a mask neural field represented as Ψm(𝐱,𝐝)(𝐜,σ,𝐟m)subscriptΨ𝑚𝐱𝐝𝐜𝜎subscript𝐟𝑚\Psi_{m}(\mathbf{x},\mathbf{d})\rightarrow(\mathbf{c},\sigma,\mathbf{f}_{m})roman_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( bold_x , bold_d ) → ( bold_c , italic_σ , bold_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ), where 𝐟msubscript𝐟𝑚\mathbf{f}_{m}bold_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a mask feature vector of dimension Dmsubscript𝐷𝑚D_{m}italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. We employ NKsubscript𝑁𝐾N_{K}italic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT pairs of query and class tokens, denoted as {(𝐐i,𝐂𝐋𝐒i)i=1,2,,NK}conditional-setsubscript𝐐𝑖subscript𝐂𝐋𝐒𝑖𝑖12subscript𝑁𝐾\left\{(\mathbf{Q}_{i},\mathbf{CLS}_{i})\mid i=1,2,\ldots,N_{K}\right\}{ ( bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i = 1 , 2 , … , italic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } where 𝐐imDsubscript𝐐𝑖subscriptsuperscript𝐷𝑚\mathbf{Q}_{i}\in\mathbb{R}^{D}_{m}bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝐂𝐋𝐒iNC+1subscript𝐂𝐋𝐒𝑖superscriptsubscript𝑁𝐶1\mathbf{CLS}_{i}\in\mathbb{R}^{N_{C}+1}bold_CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT. NCsubscript𝑁𝐶N_{C}italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT represents the number of classes. Recognizing that the number of masks extracted by SAM may vary across different viewpoints, we include an additional "no object" label \varnothing to manage scenes with fewer than N𝑁Nitalic_N masks, similar to strategies employed in prior studies [15]. From a specific viewpoint, the mask feature map 𝐅^msubscript^𝐅𝑚\mathbf{\hat{F}}_{m}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is rendered in Dm×H×Wsuperscriptsubscript𝐷𝑚𝐻𝑊\mathbb{R}^{D_{m}\times H\times W}blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_H × italic_W end_POSTSUPERSCRIPT. Utilizing this map, predicted binary masks 𝐌isubscript𝐌𝑖\mathbf{M}_{i}bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are derived through the operation {𝐌i=𝐅^m𝐐iH×Wi=1,2,,N}conditional-setsubscript𝐌𝑖subscript^𝐅𝑚subscript𝐐𝑖superscript𝐻𝑊𝑖12𝑁\left\{\mathbf{M}_{i}=\mathbf{\hat{F}}_{m}\cdot\mathbf{Q}_{i}\in\mathbb{R}^{H% \times W}\mid i=1,2,\ldots,N\right\}{ bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT ∣ italic_i = 1 , 2 , … , italic_N }, yielding masks with dimensions H×Wsuperscript𝐻𝑊\mathbb{R}^{H\times W}blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT. Each class token 𝐂𝐋𝐒isubscript𝐂𝐋𝐒𝑖\mathbf{CLS}_{i}bold_CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT corresponds to the class of each binary mask 𝐌isubscript𝐌𝑖\mathbf{M}_{i}bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. It is important to highlight that both the query tokens and class tokens are not conditioned on the viewpoint. As the rendered mask feature map 𝐅^msubscript^𝐅𝑚\mathbf{\hat{F}}_{m}over^ start_ARG bold_F end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT leverages the multi-view consistency inherent to neural fields, the query and class tokens serve as mask representations at scene level. This configuration ensures that predictions maintain multi-view consistency, enabling consistent segmentation across different viewpoints.

Training. To establish the supervision for MaskField, we extract class-agnostic masks, denoted as {MjSAMH×Wj=1,2,,N}conditional-setsubscriptsuperscriptMSAM𝑗superscript𝐻𝑊𝑗12𝑁\left\{\textbf{M}^{\text{SAM}}_{j}\in\mathbb{R}^{H\times W}\mid j=1,2,\ldots,N\right\}{ M start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT ∣ italic_j = 1 , 2 , … , italic_N }, by inputting a regular grid of 32×32323232\times 3232 × 32 point prompts into SAM. To classify these masks, we adopt the method from [17, 18] to extract visual CLIP features 𝐅visualsubscript𝐅𝑣𝑖𝑠𝑢𝑎𝑙\mathbf{F}_{visual}bold_F start_POSTSUBSCRIPT italic_v italic_i italic_s italic_u italic_a italic_l end_POSTSUBSCRIPT. Although this extraction process involves only a single forward pass of the CLIP model, it typically yields very coarse features [37, 18], which previous methods [4, 3] have found challenging for precise localization. Thanks to MaskField’s mask-level distillation, our approach reduces reliance on high localization accuracy of features. We further compute relevance map R by:

R=cosFtext,FvisualNC×H×WRsubscriptF𝑡𝑒𝑥𝑡subscriptF𝑣𝑖𝑠𝑢𝑎𝑙superscriptsubscript𝑁𝐶𝐻𝑊\displaystyle\textbf{R}=\cos\langle\textbf{F}_{text},\textbf{F}_{visual}% \rangle\in\mathbb{R}^{N_{C}\times H\times W}R = roman_cos ⟨ F start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT , F start_POSTSUBSCRIPT italic_v italic_i italic_s italic_u italic_a italic_l end_POSTSUBSCRIPT ⟩ ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT × italic_H × italic_W end_POSTSUPERSCRIPT (3)

and apply a softmax function to R to derive coarse class probabilities pNC×H×Wpsuperscriptsubscript𝑁𝐶𝐻𝑊\textbf{p}\in\mathbb{R}^{N_{C}\times H\times W}p ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT × italic_H × italic_W end_POSTSUPERSCRIPT. The classification of masks is then determined by computing the mean class probabilities within each mask region, identifying the mask semantic class based on the highest probabilities:

SSAM=argmaxs(1|𝐌𝐣SAM|u𝐌𝐣SAM𝐩s(u)).superscript𝑆SAMsubscript𝑠1subscriptsuperscript𝐌SAM𝐣subscript𝑢subscriptsuperscript𝐌SAM𝐣subscript𝐩𝑠𝑢\displaystyle S^{\text{SAM}}=\arg\max_{s}\left(\frac{1}{|\mathbf{M^{\text{SAM}% }_{j}}|}\sum_{u\in\mathbf{M^{\text{SAM}}_{j}}}\mathbf{p}_{s}(u)\right).italic_S start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG | bold_M start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ bold_M start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u ) ) . (4)

Here, u𝑢uitalic_u denotes a pixel that belongs to 𝐌𝐣SAMsubscriptsuperscript𝐌SAM𝐣\mathbf{M^{\text{SAM}}_{j}}bold_M start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT and 𝐩s(u)subscript𝐩𝑠𝑢\mathbf{p}_{s}(u)bold_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u ) denotes the probability that u belongs to semantic class s𝑠sitalic_s.

To train the parameters of MaskField, we need to match the predicted masks M with 𝐌SAMsuperscript𝐌SAM\mathbf{M^{\text{SAM}}}bold_M start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT. Similar to previous works in mask-based segmentation [15, 38, 24], MaskField adopts a bipartite matching-based assignment strategy to align each predicted mask with a mask extracted by SAM. Given a matching {(Mi,𝐂𝐋𝐒i),(MjSAM,SjSAM)}subscriptM𝑖subscript𝐂𝐋𝐒𝑖superscriptsubscriptM𝑗SAMsubscriptsuperscript𝑆SAM𝑗\{(\textbf{M}_{i},\mathbf{CLS}_{i}),(\textbf{M}_{j}^{\text{SAM}},S^{\text{SAM}% }_{j})\}{ ( M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) }, the distillation loss function is computed by a linear combination of a dice loss [39], focal loss [40] to optimize the binary mask, and a cross-entropy classification loss for mask classification:

distill=ce(𝐂𝐋𝐒i,SjSAM)+𝟙SjSAM[λfocalfocal(𝐌i,MjSAM)+λdicedice(Mi,MjSAM)].subscript𝑑𝑖𝑠𝑡𝑖𝑙𝑙subscript𝑐𝑒subscript𝐂𝐋𝐒𝑖subscriptsuperscript𝑆SAM𝑗subscript1superscriptsubscript𝑆𝑗SAMdelimited-[]subscript𝜆𝑓𝑜𝑐𝑎𝑙subscript𝑓𝑜𝑐𝑎𝑙subscript𝐌𝑖superscriptsubscriptM𝑗SAMsubscript𝜆𝑑𝑖𝑐𝑒subscript𝑑𝑖𝑐𝑒subscriptM𝑖superscriptsubscriptM𝑗SAM\displaystyle\mathcal{L}_{distill}=\mathcal{L}_{ce}\left(\mathbf{CLS}_{i},S^{% \text{SAM}}_{j}\right)+\mathbbm{1}_{S_{j}^{\text{SAM}}\neq\varnothing}\left[% \lambda_{focal}\mathcal{L}_{focal}\left(\mathbf{M}_{i},\textbf{M}_{j}^{\text{% SAM}}\right)+\lambda_{dice}\mathcal{L}_{dice}\left(\textbf{M}_{i},\textbf{M}_{% j}^{\text{SAM}}\right)\right].caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_i italic_l italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( bold_CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + blackboard_1 start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT ≠ ∅ end_POSTSUBSCRIPT [ italic_λ start_POSTSUBSCRIPT italic_f italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_f italic_o italic_c italic_a italic_l end_POSTSUBSCRIPT ( bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT ( M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SAM end_POSTSUPERSCRIPT ) ] . (5)

Inference. To compose the final segmentation map, we aggregate the predicted masks using a simple matrix multiplication. Similar to previous works [15], we perform a marginalization over the predicted mask probabilities. Specifically, we compute the class of a pixel by:

S(u)=argmaxsi=1NK𝐂𝐋𝐒i(s)𝐌i(u).𝑆𝑢subscript𝑠superscriptsubscript𝑖1subscript𝑁𝐾subscript𝐂𝐋𝐒𝑖𝑠subscript𝐌𝑖𝑢\displaystyle S(u)=\arg\max_{s\neq\varnothing}\sum_{i=1}^{N_{K}}\mathbf{CLS}_{% i}(s)\cdot\mathbf{M}_{i}(u).italic_S ( italic_u ) = roman_arg roman_max start_POSTSUBSCRIPT italic_s ≠ ∅ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_CLS start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ⋅ bold_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u ) . (6)

In this way, masks segmented by SAM are effectively combined based on their classification and mask confidence, resulting in a more cohesive and accurate final prediction that robustly handles varying scales.

MaskField simplifies the segmentation process by eliminating the need to manage pixel-aligned, high-dimensional CLIP features. Instead, it averages the CLIP semantics on a per-mask basis and conducts distillation at the mask level. Notably, MaskField does not necessitate the use of any foundational models during training and allows for the selection of a low-dimensional feature vector, fmsubscriptf𝑚\textbf{f}_{m}f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. These modifications significantly improve efficiency, simplify the segmentation workflow and reduce both computational overhead.

4 Experiments

Table 1: Quantitative comparisons on mIoU. We report the mIoU(\uparrow) scores of the following methods and highlight the best, second-best, and third-best scores.
Methods bed sofa lawn room bench table blue covered snacks office
sofa desk desk
2D LSeg [23] 56.0 04.5 17.5 19.2 06.0 07.6 17.6 21.8 32.1 19.6
ODISE [27] 52.6 48.3 39.8 52.5 24.1 39.7 55.4 67.3 33.7 55.5
NeRF FFD [22] 56.6 03.7 42.9 25.1 06.1 07.9 16.3 26.8 34.9 19.0
Sem(ODISE) [41] 50.3 27.7 24.2 29.5 25.6 18.4 52.8 45.8 23.4 65.4
LERF [3] 73.5 27.0 73.7 46.6 53.2 33.4 32.2 51.8 48.7 43.9
3D-OVS [4] 89.5 74.0 88.2 92.8 89.3 88.8 82.8 88.6 95.8 91.7
3DGS Langsplat [5] 73.5 82.3 89.9 95.0 70.6 77.8 94.5 80.2 92.0 93.6
Feature GS [6] 56.6 06.7 37.3 20.5 06.2 08.3 18.0 23.8 32.5 20.0
Ours (NeRF) 96.7 92.3 91.3 92.9 94.3 92.6 96.1 91.5 95.8 85.6
Ours (3DGS) 97.6 92.1 90.0 96.0 93.5 93.2 95.1 91.5 97.1
Table 2: Quantitative comparisons on Acc. We report the Acc(\uparrow) scores of the following methods and highlight the best, second-best, and third-best scores.
Methods bed sofa lawn room bench table blue covered snacks office
sofa desk desk
2D LSeg [23] 87.6 16.5 77.5 46.1 42.7 29.9 68.9 79.4 84.5 41.5
ODISE [27] 86.5 35.4 82.5 59.7 39.0 34.5 56.1 75.0 63.4 61.0
NeRF FFD [22] 86.9 09.5 82.6 51.4 42.8 30.1 77.6 82.6 84.1 41.6
Sem(ODISE) [41] 86.5 22.2 80.5 61.5 56.4 30.8 58.1 61.4 57.0 88.3
LERF [3] 86.9 43.8 93.5 9.8 79.7 41.0 85.0 91.4 91.5 88.2
3D-OVS [4] 96.7 91.6 97.3 98.9 96.3 96.5 97.7 97.2 99.1 96.2
3DGS Langsplat [5] 89.7 98.7 95.6 99.4 92.6 92.3 99.4 92.1 96.3 99.1
Feature GS [6] 87.5 12.4 82.6 36.7 43.0 32.2 77.3 80.6 84.7 41.8
Ours (NeRF) 99.2 97.9 98.9 97.9 98.7 98.3 99.5 97.4 99.2 91.7
Ours (3DGS) 99.4 97.6 98.7 99.4 98.6 98.3 99.4 97.4 99.4

Implementation Details. We implement MaskField using PyTorch, and all experiments are conducted on a GeForce RTX 3090 GPU with 32GB RAM. We adopt TensorRF  [42] as the backbone and extract CLIP features using MaskCLIP  [17] and FeatUP  [18]. The hyperparameters for training TensorRF are strictly aligned with the same training settings of  [4] for fair comparison. Additionally, we implement MaskField with 3DGS  [2] following the setting of Feature GS  [6]. Unless otherwise specified, we use 180 total mask query tokens and class token pairs and the feature dimension is set to 16 for both NeRF and 3DGS. During mask classification, masks with a confidence level below 0.4 are filtered out to ensure robust training. We also emply a resolution warm-up strategy from (48×64)4864(48\times 64)( 48 × 64 ) to 384×512384512384\times 512384 × 512 during training. More details on the implementations are in the Appendix A.

Datesets and Metrics. To evaluate the effectiveness of our approach, we follow the protocols established in previous works  [4]; we use the 3D-OVS  [4] dataset for a comprehensive evaluation, which is specifically designed for 3D open-vocabulary semantic segmentation. The 3D-OVS dataset includes a diverse collection of long-tail objects captured in various poses and backgrounds, making it ideal for assessing the performance of our method in complex and varied scenarios. Unlike previous works  [4, 5] that use a subset of the 3D-OVS dataset, we report our performance on all 10 scenes to gain a more comprehensive evaluation. Following previous works  [4, 5], we report mean Intersection-over-Union (mIoU) and mean pixel accuracy (Acc). For more details on the dataset and reasons for dataset selection can be referred to Sec. A.1.

Baseline Methods. We have selected four NeRF-based methods capable of 3D open-vocabulary segmentation for our analysis: FFD [22], Semantic-NeRF [41], LERF [3], and 3D-OVS [4]. Additionally, we include results from independently segmenting each test view using 2D open-vocabulary segmentation methods such as LSeg [23], OSIDE [27]. Furthermore, we implement MaskField with 3DGS and compare it with other 3DGS-based methods like Feature 3DGS [6] and Langsplat [5]. Among these approaches, 3D-OVS serves as a direct baseline for MaskField due to their common properties: both are weakly supervised, requiring a full-text list of the scene and focus on semantic segmentation of the entire scene, and distill knowledge directly from a frozen CLIP model, contrasting with methods like Feature GS, FFD, and Semantic-NeRF that may involve fine-tuning CLIP or adapting 2D segmentation techniques. These similarities make 3D-OVS an appropriate benchmark for assessing the effectiveness of MaskField in handling 3D open-vocabulary segmentation under weak supervision.

Refer to caption
Figure 3: Qualitative comparisons of 3 different scenes. Our method successfully recognizes long-tailed objects and gives the most accurate segmentation maps.

4.1 Comparative Results

Sec. 4 and Sec. 4 details the performance of MaskField alongside baseline methods. The qualitative results are given in Fig. 3. Additional qualitative results such as distilled mask and PCA visualization of the mask feature map can be found in Appendix B.

MaskField not only surpasses traditional 2D segmentation approaches but also significantly outperforms previous neural field-based methods such as LERF and 3D-OVS. Notably, methods like LSeg, ODISE, FFD and Feature GS employ features from a fine-tuned CLIP model to enhance boundary precision. However, fine-tuning can damage the extensive open-vocabulary knowledge inherent in CLIP, posing challenges in recognizing long-tailed classes. Futhermore, being trained solely on 2D images, ODISE tends to produce results that are inconsistent across multiple views, thus compromising performance. While Sem(ODISE) addresses these multi-view inconsistencies, it fails to tackle variations in object scale. In other words, Sem(ODISE) only accepts one full segmentation map as supervision, whereas our methods can accept and aggregate multiple masks of one object with different scales, ensuring more robustness. Although 3D-OVS and LERF have implemented strategies to mitigate the ambiguity in CLIP features, their outcomes are generally less impressive, often marred by patchiness and overly smoothed features. Langsplat advocates to use SAM in feature extraction, it requires training different models in different scales thus hinder efficiency. Nevertheless, LERF and Langsplat offers the advantage of training without needing a complete list of categories, which our method does not currently support. Given that this paper primarily focuses on proposing an alternative to the per-pixel distillation of CLIP features into neural feature fields, we consider integrating this feature as a direction for future work.

Cooperation with 3DGS. MaskField introduces a novel learning paradigm for 3D segmentation using neural fields. The effectiveness of MaskField has been confirmed in conjunction with the recently proposed 3D Gaussian Splatting (3DGS) [2]. Regardless of the type of neural fields employed, MaskField consistently outperforms prior methods. Notably, previous 3DGS-based methods, like Langsplat and Feature GS, often reduce the dimensions of CLIP features. This is necessary because high-dimensional vectors linked to each Gaussian cause large memory usage and slow rasterization. MaskField, however, operates efficiently with low-dimensional mask features. This avoids the need to design additional dimension reduction strategies to reduce feature dimensions, making MaskField a natural choice for 3D segmentation with 3DGS.

Table 3: Training efficiency analysis on the sofa scene.
Feature mIoU Accuracy Training Inference
dimension time
LERF[3] 512 27.0 43.8 19.4 min 121.4 s
3D-OVS[4] 512 74.0 92.3 78 min 6.6 s
Langsplat[5] 3 82.3 97.9 66 min 401.9 s
Feature GS[6] 128 06.7 12.4 87 min 6.0 s
MaskField (NeRF) 16 92.3 97.9 5 min 7.1 s
MaskField (3DGS) 8 91.6 97.5 19 min 4.9 s
MaskField (3DGS) 16 92.1 97.6 22 min 5.5 s
MaskField (3DGS) 32 92.1 97.6 37 min 5.6 s
MaskField (3DGS) 128 92.2 97.7 84 min

4.2 Training Time

The training time is compared and analyzed with the previous state-of-the-art method, and the results are displayed in Sec. 4.1. All the experiments are trained in 15k iterations. For the NeRF-based method, we use a batch size of 4096 rays. Since our methods requires to render the whole image during training, we recalculated the total iterations to match the 4096 batch size for a fair comparison. A significant advancement is achieved by MaskField. MaskField demonstrates significant advancements. Unlike LERF, which faces challenges with multi-scale supervision of the CLIP feature pyramid, and 3D-OVS, burdened by complex regularization requirements during training, MaskField streamlines the process. Furthermore, while Langsplat requires training separate models at different scales and Feature GS struggles with slow rasterization due to the high feature dimensions of its Gaussians, MaskField efficiently circumvents these issues. Further investigation into the impact of feature dimension reveals only marginal improvements with increases in dimension. This suggests that MaskField is particularly well-suited for 3DGS, where high-dimensional vectors associated with each Gaussian typically lead to extensive memory usage and slow rasterization. The efficiency of MaskField can be attributed to two main factors: First, distilling binary masks avoids handling high-dimensional CLIP features during training. Second, mask distillation naturally introduces object boundaries extracted by SAM. Successfully address the ambiguity in CLIP features without complex regularization during training.

4.3 Ablation Study

A series of ablation studies is performed to confirm the efficiency and effectiveness of the proposed MaskField.

Table 4: Per-pixel vs. Mask Distillation.
mIoU Accuracy
Per-pixel (coarse CLIP) 59.0 78.8
Per-pixel (SAM s) 71.3 83.0
Per-pixel (SAM m) 67.7 81.9
Per-pixel (SAM l) 75.5 82.1
MaskField 92.3

Per-pixel vs. Mask Distillation. In Sec. 4.3, we verify the improvements achieved through mask distillation. Initially, we examine the performance gained by employing SAM-generated masks. Specifically, we perform per-pixel distillation using coarse CLIP features and report its performance. Subsequently, we use masks generated by SAM to compute CLIP embeddings, akin to the approach used in Langsplat [5]. Each pixel is assigned a CLIP feature corresponding to its associated SAM mask. Given that SAM extracts masks at three different scales, we train three models at these scales and document their performance. Direct distillation of per-pixel CLIP features without regularization tends to lead to ambiguity around object boundaries and parts with similar appearances. Using CLIP features with regions segmented by SAM alleviates boundary issues but still suffers from ambiguity related to scale. MaskField naturally incorporates the boundaries delineated by SAM and also integrates different object scales. This is because MaskField allows one pixel to belong to several masks that describe different scales of the object, whereas previous methods with pixel-level distillation only accept one full feature map as supervision. This distinction allows MaskField to more effectively manage the scales and detail in the segmentation process.

Table 5: Number of Queries.
mIoU Accuracy
50 43.6 20.0
100 90.8 97.5
140 90.3 97.2
180 92.3 97.6
220 91.5

Number of Queries. Sec. 4.3 shows MaskField trained with different number of queries. MaskField with 180 queries performs the best. We observe a significinet drop of performence when there are too little mask query to represent the scene. However, the performance differences when growing the number of queries after 100 are minimal. This suggests that the exact number of queries does not need to be meticulously chosen, indicating the robustness of this parameter in the MaskField setup.

5 Discussion and Conclusion

In this paper, we introduce MaskField, a novel learning paradigm for 3D segmentation using neural fields. Our primary objective is to demonstrate that mask distillation can be a promising alternative to per-pixel distillation in 3D segmentation. Mask distillation aligns naturally with neural fields as it avoids the complexities of managing ambiguous, high-dimensional CLIP features during training, thereby enhancing efficiency. Notably, for technologies like 3D Gaussian Splatting (3DGS), MaskField does not require strategies to reduce CLIP feature dimensions.

Extensive experiments confirm that MaskField not only outperforms previous state-of-the-art methods but also achieves exceptionally rapid convergence, surpassing prior techniques with just 5 minutes of training. However, there are limitations that we identify as future works. First, MaskField currently relies on a predefined comprehensive object list for the scene. This requirement could potentially be circumvented by matching masks in the feature space of advanced foundational models such as DINO [13], rather than classifying masks prior to training. Second, incorrect classification of masks could introduce noise into the training process. While we anticipate that this noise might be mitigated through the averaging of multi-view information in neural fields, develo** more robust strategies to directly tackle this issue could be beneficial.

We hope that MaskField can offer valuable insights into the advancement of 3D scene segmentation using neural fields, encouraging further research and development in this promising area.

References

  • Mildenhall et al. [2020] B Mildenhall, PP Srinivasan, M Tancik, JT Barron, R Ramamoorthi, and R Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, 2020.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. 2023.
  • Kerr et al. [2023] Justin* Kerr, Chung Min* Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023.
  • Liu et al. [2023] Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open-vocabulary segmentation. Advances in Neural Information Processing Systems, 36:53433–53456, 2023.
  • Qin et al. [2023] Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. arXiv preprint arXiv:2312.16084, 2023.
  • Zhou et al. [2023a] Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. arXiv preprint arXiv:2312.03203, 2023a.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Shafiullah et al. [2022] Nur Muhammad Mahi Shafiullah, Chris Paxton, Lerrel Pinto, Soumith Chintala, and Arthur Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022.
  • Yang et al. [2023a] Ze Yang, Yun Chen, **gkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1389–1399, 2023a.
  • Yang et al. [2023b] Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, et al. EmerNeRF: Emergent spatial-temporal scene decomposition via self-supervision. arXiv preprint arXiv:2311.02077, 2023b.
  • Hu et al. [2021] Qingyong Hu, Bo Yang, Sheikh Khalid, Wen Xiao, Niki Trigoni, and Andrew Markham. Towards semantic segmentation of urban-scale 3d point clouds: A dataset, benchmarks and challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Zhang and Zhang [2022] Lefei Zhang and Liangpei Zhang. Artificial intelligence for remote sensing data analysis: A review of challenges and opportunities. IEEE Geoscience and Remote Sensing Magazine, 10(2):270–294, 2022.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Cheng et al. [2021] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in neural information processing systems, 34:17864–17875, 2021.
  • Ding et al. [2022] Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022.
  • Zhou et al. [2022] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In European Conference on Computer Vision, pages 696–712. Springer, 2022.
  • Fu et al. [2024] Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: A model-agnostic framework for features at any resolution. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GkJiNn2QDF.
  • Barron et al. [2021] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855–5864, 2021.
  • Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  • Yu et al. [2023a] Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splatting. arXiv preprint arXiv:2311.16493, 2023a.
  • Kobayashi et al. [2022] Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems, 35:23311–23330, 2022.
  • Li et al. [2021] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic segmentation. In International Conference on Learning Representations, 2021.
  • Ghiasi et al. [2022] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  • Xu et al. [2022] Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, and Xiang Bai. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision, pages 736–753. Springer, 2022.
  • Liang et al. [2023] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
  • Xu et al. [2023a] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023a.
  • Zhou et al. [2023b] Ziqin Zhou, Yinjie Lei, Bowen Zhang, Lingqiao Liu, and Yifan Liu. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11175–11185, 2023b.
  • Xu et al. [2023b] Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023b.
  • Cherti et al. [2023] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  • Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  • Yu et al. [2023b] Tao Yu, Runseng Feng, Ruoyu Feng, **ming Liu, Xin **, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023b.
  • Zhang et al. [2023] Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Yu Qiao, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot. In The Twelfth International Conference on Learning Representations, 2023.
  • Ke et al. [2024] Lei Ke, Mingqiao Ye, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu, et al. Segment anything in high quality. Advances in Neural Information Processing Systems, 36, 2024.
  • Ma et al. [2024] Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images. Nature Communications, 15(1):654, 2024.
  • Wu et al. [2023] Size Wu, Wenwei Zhang, Lumin Xu, Sheng **, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023.
  • Wang et al. [2021] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5463–5474, 2021.
  • Milletari et al. [2016] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  • Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • Zhi et al. [2021] Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and Andrew J Davison. In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838–15847, 2021.
  • Chen et al. [2022] Anpei Chen, Zexiang Xu, Andreas Geiger, **gyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.

Appendix

In this supplement, we provide more implementation details on datasets and baseline methods. We also include more qualitative results. Finally, we discuss the limitations and future works.

Appendix A More Implementation Details

A.1 Datasets

The 3D-OVS  [4] dataset is specifically designed for 3D open-vocabulary semantic segmentation. The 3D-OVS dataset includes a diverse collection of long-tail objects captured in various poses and backgrounds, making it ideal for assessing the performance of our method in complex and varied scenarios. The 3D-OVS dataset contains full segmentation annotations where some other recent proposed dataset such as LERF  [3] only chose to annotate some foreground objects. Since MaskField perform semantic segmentation on of the full image, the LERF dataset would be unsuited to perform such a evaluation. Unlike previous works  [4, 5] that use a subset of the 3D-OVS dataset, we report our performance on all 10 scenes to gain a more comprehensive evaluation.

A.2 Baseline Methods

We implement the baseline methods following their open-source code base. All the experiments are conducted with 15k iteration training. For NeRF-based methods, a batch size of 4096 rays is adopted. For 3DGS based method, we keep the resolution fixed at 384×512384512384\times 512384 × 512. In particular LERF  [3] and DFF  [22] are implemented based on Instant-NGP  [43] and others  [41, 4] based on TensoRF  [42].

A.3 Implementation of MaskField

We developed two versions of MaskField: MaskField (NeRF) and MaskField (3DGS). For MaskField (NeRF), our implementation builds upon the 3D-OVS [4] codebase, adhering strictly to the original settings and hyperparameters to ensure a fair comparison. Likewise, for MaskField (3DGS), we based our implementation on the methods described in Feature GS [6], carefully aligning the hyperparameters to facilitate consistent and comparable results across different models. This allows us to accurately evaluate the performance improvements and efficiencies that MaskField introduces within various 3D segmentation frameworks. For bipartite matching of masks, we utilized the implementation available in scipy111https://pypi.org/project/scipy/. During mask classification, masks with a confidence level below 0.4 are filtered out to ensure robust training. Only the remaining masks are used for further training processes.

Refer to caption
Figure 4: Visualization of the learned masks.
Refer to caption
(a) Additional visualization on the sofa scene
Refer to caption
(b) Additional visualization on the bed scene
Figure 5: Additional visualization on the sofa and bed scene.
Refer to caption
(a) Additional visualization on the sofa scene
Refer to caption
(b) Additional visualization on the bed scene
Figure 6: Additional visualization on the lawn and bench scene.
Refer to caption
Figure 7: Additional visualization on the blue sofa scene.

Appendix B More Qualitative Results

We provide additional qualitative results on the remaining scenes to provide a more comprehensive demonstration.

B.1 Learned Masks

The queried mask is visualized in Fig. 4. From the figure, we observe that the mask is successfully distilled into the mask neural field.

B.2 PCA Visualization of the Learned Mask Feature Map

To explore the characteristics of the learned mask features, we visualized the rendered mask feature map by calculating the three principal components using PCA. The visualization results are displayed in Fig. 5, Fig. 6 and Fig. 7. We observed that the learned mask features delineate clear boundaries between objects, suggesting effective object decomposition through mask distillation. This observation underscores the potential for future research into object decomposition using neural fields.

B.3 More Visualization on Different Scenes

More segmentation results are visualized in Fig. 5, Fig. 6 and Fig. 7. We consistently perform accurate segmentation.

Appendix C Limitations

Currently, MaskField depends on a predefined list of categories within the scene, and its performance may be compromised if incorrect descriptions are provided. This limitation arises because the method relies on the classification of SAM-generated masks. MaskField attempts to manage inaccurately classified masks by setting a confidence threshold to filter out unreliable masks. However, this strategy does not fully leverage the robustness against noise that is typically inherent in foundation models. This dependency on correct category descriptions and the method’s current approach to error handling exposes a vulnerability to inaccuracies in input data and potential misclassifications.

To mitigate these issues, we suggest develo** a mask matching strategy based on features from foundation models could be promising. Utilizing feature matching could leverage the robustness inherent in these models, enhancing the MaskField’s ability to handle noise and inaccuracies. This approach could also facilitate training a feature field without the need for predefined categories, thus broadening the applicability and flexibility of the segmentation process. By focusing on feature-based matching rather than relying solely on category labels, this method would enable a more dynamic and adaptive segmentation framework. It could effectively harness the complex, high-dimensional feature spaces generated by foundation models, using them to identify and align similar features across different segments without explicit categorical guidance.

We identify this research direction as future work, recognizing its potential to significantly improve the robustness and versatility of MaskField. This development could lead to more sophisticated and reliable segmentation methods, particularly in open-vocabulary settings where the variability and uncertainty of inputs are considerable challenges.

Appendix D Potential Future Application and Social Impact

Scene Decomposition. An exciting avenue is the possibility of 3D scene decomposition. By leveraging the specificity of mask query tokens and class tokens, users could precisely select individual or groups of objects within a 3D scene. This capability would be valuable in applications ranging from virtual reality settings to interactive design and planning tools, where users need to manipulate elements dynamically. Futhermore, the ability to delete or add objects on-the-fly within a 3D scene could greatly enhance the flexibility of digital content creation, particularly in the fields of animation and game development. Such manipulations would allow creators to iterate more freely and efficiently, experimenting with different compositions and scenarios without the need for complex backend changes.

Robotics. For robots operating in complex settings like warehouses or manufacturing plants, MaskField can facilitate fast semantic awareness, enabling robots to perform tasks that require nuanced understanding of their surroundings under. This is particularly important in fast-paced settings where delays can lead to inefficiencies or accidents. Robots equipped can adapt to changes in their environment swiftly, enhancing their responsiveness and operational effectiveness.

Privacy Concerns. The deployment of real-time sophisticated monitoring and segmentation technologies in public spaces or even within private settings can lead to significant privacy concerns. There is the potential for misuse in surveillance, data breaches, or even in ways that could infringe upon personal liberties if not regulated properly.