Category-level Neural Field for Reconstruction of
Partially Observed Objects in Indoor Environment

Taekbeom Lee, Youngseok Jang, and H. ** Kim Taekbeom Lee and H. ** Kim are with the Aerospace Engineering Department, Seoul National University, South Korea [email protected], [email protected]Youngseok Jang is with the Mechanical and Aerospace Engineering Department, Seoul National University, South Korea [email protected]This research was supported by Unmanned Vehicles Core Technology Research and Development Program through the National Research Foundation of Korea(NRF) and Unmanned Vehicle Advanced Research Center(UVARC) funded by the Ministry of Science and ICT, the Republic of Korea(NRF-2020M3C1C1A010864)
Abstract

Neural implicit representation has attracted attention in 3D reconstruction through various success cases. For further applications such as scene understanding or editing, several works have shown progress towards object-compositional reconstruction. Despite their superior performance in observed regions, their performance is still limited in reconstructing objects that are partially observed. To better treat this problem, we introduce category-level neural fields that learn meaningful common 3D information among objects belonging to the same category present in the scene. Our key idea is to subcategorize objects based on their observed shape for better training of the category-level model. Then we take advantage of the neural field to conduct the challenging task of registering partially observed objects by selecting and aligning against representative objects selected by ray-based uncertainty. Experiments on both simulation and real-world datasets demonstrate that our method improves the reconstruction of unobserved parts for several categories.

I INTRODUCTION

Recent approaches leveraging neural implicit representations have shown promising outcomes not only in view synthesis but also in 3D reconstruction. Particularly, NeRF employs a multi-layer perceptron (MLP) to train a continuous function map** 3D coordinates to the associated volume density and radiance. Such coordinate-MLP methods have the advantage of compact memory representations over explicit representation, as they encode scene information through neural network parameters. Their continuous nature offers complete reconstructions, addressing challenges such as unobserved region holes often seen in classical volumetric fusion methods.

Refer to caption
Figure 1: Our method reconstructs objects using category-level models. Objects belonging to the same category share common shape properties, which help to reconstruct unobserved parts plausibly. On the other hand, unobserved parts of objects that are reconstructed by the object-level model (vMAP) tend to over-smooth or fail to recover complete geometry.

However, the majority of coordinate-MLP methods focus on scene-level representations, leaving a notable gap in achieving object-level understanding. Object-compositional representation decomposes scene into semantic units, objects, and this semantic composition further advances potential applications such as scene understanding [1][2], editing [3][4], and AR/VR [5][6]. Some research has proposed NeRF-based models capable of representing both the scene and individual objects using 2D object masks as additional input. Some of them train a single MLP which represents the whole scene and leverages a specific branch to represent objects [3][7]. However, these methods are inefficient in that the entire network must be learned and inferred to represent a single object.

Other studies train object-wise MLP. Most of these methods are category-level, which leverage separate MLP for each category to learn common characteristics such as the shape and texture of objects in the same category [4][8][9]. The learned model acts as prior knowledge for its category to reconstruct objects which was only partially observed during training. However, these methods can be only used for limited categories that have sufficiently large amounts of data. Some studies [10][11] train instance-wise MLP and overcome this limitation. vMAP [10] models the neural field of objects into separate neural networks trained independently and shows it can reconstruct watertight object mesh without prior. Panoptic Neural Field [11] also incorporates separate neural fields for individual objects and employs meta-learned initialization as a category-level prior only if a large dataset for the category exists. However, they do not utilize common information intra-category, and their performance in reconstructing unseen parts of objects remains insufficient without prior.

We accordingly propose a category-level model-based method that does not use prior knowledge while utilizing common information in the category. We train NeRF model for each category of observed objects. We do not directly use the output of semantic segmentation as each object’s category because semantic segmentation models predict an object’s category mainly considering the semantic commonality, and even objects in the same category can significantly differ in shape. Instead, we estimate shape similarity between each object pair and break objects with different shapes into subcategories. To train common knowledge shared by the objects in the same subcategory, it is necessary to align the objects in 3D space. Since the observed parts vary for each partially observed object, aligning objects presents a challenge in learning category-level models. To address this challenge, we determine the most well-observed object as representative for each category using a ray-based uncertainty metric and transform other objects into their normalized object coordinate space (NOCS). Experiments on both simulated and real-world datasets show that our method can improve the reconstruction of unobserved parts of common objects in indoor scenes. In summary, the primary contributions of the paper are:

  • We propose an object-level map** system that enhances reconstruction capabilities for unobserved parts by learning category-level knowledge solely from the observed data.

  • We propose a method that adaptively subcategorizes objects based on their observed shape, which allows the objects to share common 3D information through category-level models.

  • We propose to decide representative using ray-based uncertainty and register objects to its NOCS per each category, which addresses the challenge of learning category-level models from partially observed objects.

Refer to caption
Figure 2: Overview of the proposed framework

II RELATED WORK

II-A Neural implicit representation for 3D reconstruction

Recent years have seen significant interest in studies that utilize neural network to learn implicit representations for 3D shape reconstruction. They adopt a variety of representations, such as occupancy [12][13][14], signed distance functions [15][16][17], and density [18]. Of particular interest is NeRF [18] which shows impressive results in novel view synthesis. NeRF employs a multi-layer perceptron (MLP) to represent a scene using a neural radiance field comprised of volume density and radiance, and leverage classical volumetric rendering. Its capability to effectively capture geometric details, in addition to its capability for complete reconstruction and memory efficiency, has led to its application not only for single object but also in room-level dense 3D reconstructions [19][20][21]. Some works [22][23][24][25] achieve real-time SLAM utilizing active sampling, keyframe selection, or applying frameworks[26][27] for accelerating training time. All of these methods consider representing the entire scene only, and they are not capable of reconstructing individual object and further applications such as object manipulation.

II-B Object-compositional neural implicit representation

Scene-level methods. Following promising results obtained in 3D reconstruction with neural implicit representations, efforts have been made to acquire individual object representations using neural fields. Most studies model all objects present in a scene using a single MLP [3][7][28][29]. ObjectNeRF [3] uses an additional object branch which renders individual objects. ObjSDF [7] predicts each object’s SDF from a unified branch and renders both scene and individual objects based on this prediction. However, these methods are inefficient in both learning and rendering, as a large amount of network parameters are shared, even among dissimilar objects.

Category-level methods. Category-level NeRF methods train and render various objects within a category using a single NeRF model combined with individual latent codes [8][9][30][31]. Each model learns category-level prior knowledge from the training dataset and successfully renders even in challenging situations such as few-shot scenarios at test time. CodeNeRF [8] conditions separate shape and texture codes for each instance on a shared MLP within a category, achieving disentanglement of shape and texture for individual objects. AutoRF [9] adopts an encoder-decoder model structure that encodes shape and texture from images and uses category-level NeRF as the decoder. However, these approaches require extensive training with a large number of objects within each category, limiting their application to specific categories. In contrast, our method learns category-level information existing in observed scenes for any category, eliminating the need for external datasets.

Instance-specific methods. Alternatively, some studies have employed independent neural fields for each object [10][11]. Panoptic Neural Field [11] trains with a separate NeRF for the background and each object. vMAP [10] also models each object with a neural field, independently learning for each object using its own keyframe buffer to perform efficient online object map**. However, vMAP struggles to reconstruct unseen parts of objects successfully because it does not learn common characteristics within a category. In contrast, our approach learns shared properties of a category from different objects within the scene, enabling more effective reconstruction of occluded regions.

III METHOD

We reconstruct individual objects and composite a complete indoor scene using M𝑀Mitalic_M posed RGB-D sequences ={Ii}i={1,,M}subscriptsubscript𝐼𝑖𝑖1𝑀\mathcal{I}=\left\{I_{i}\right\}_{i=\{1,\ldots,M\}}caligraphic_I = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = { 1 , … , italic_M } end_POSTSUBSCRIPT and 𝒟={Di}i={1,,M}𝒟subscriptsubscript𝐷𝑖𝑖1𝑀\mathcal{D}=\left\{D_{i}\right\}_{i=\{1,\ldots,M\}}caligraphic_D = { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = { 1 , … , italic_M } end_POSTSUBSCRIPT. For each frame, high-quality instances and semantic segmentation masks are given from an off-the-shelf 2d instance segmentation network or the dataset itself. The overview of our method is shown in Fig. 2.

III-A Preliminaires

NeRF. NeRF learns neural implicit representation from multi-view images. Specifically, it takes posed images as input and represents a scene in terms of volume density and radiance. NeRF employs an MLP to learn an implicit function f𝑓fitalic_f which maps a 3d point p3𝑝superscript3p\in\mathbb{R}^{3}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and a viewing direction d3𝑑superscript3d\in\mathbb{R}^{3}italic_d ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to volume density σ𝜎\sigma\in\mathbb{R}italic_σ ∈ blackboard_R and color c3𝑐superscript3c\in\mathbb{R}^{3}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for the given scene. To render a 2D image, NeRF casts a ray r(t)=o+td(t0)𝑟𝑡𝑜𝑡𝑑𝑡0r(t)=o+td\;(t\geq 0)italic_r ( italic_t ) = italic_o + italic_t italic_d ( italic_t ≥ 0 ) from the camera origin o𝑜oitalic_o in the direction d𝑑ditalic_d towards each pixel. The radiance of each pixel is approximated by integrating the radiance along the ray.

C(r)=0T(t)σ(r(t))c(r(t),d)𝑑t,𝐶𝑟superscriptsubscript0𝑇𝑡𝜎𝑟𝑡𝑐𝑟𝑡𝑑differential-d𝑡\displaystyle C(r)=\int_{0}^{\infty}T(t)\sigma(r(t))c(r(t),d)dt\;,italic_C ( italic_r ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_T ( italic_t ) italic_σ ( italic_r ( italic_t ) ) italic_c ( italic_r ( italic_t ) , italic_d ) italic_d italic_t , (1)

where T(t)𝑇𝑡T(t)italic_T ( italic_t ) represents the accumulated transmittance along the ray, meaning the probability that the ray travels to t𝑡titalic_t without any collision with other particles. This formulation can be approximated as a weighted sum through the quadrature rule [32]. Similarly, depth can be also rendered:

C^(r)=i=1Nwici,D^(r)=i=1Nwidi,formulae-sequence^𝐶𝑟superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝑐𝑖^𝐷𝑟superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝑑𝑖\displaystyle\hat{C}(r)=\sum_{i=1}^{N}w_{i}c_{i},\quad\hat{D}(r)=\sum_{i=1}^{N% }w_{i}d_{i}\;,over^ start_ARG italic_C end_ARG ( italic_r ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_D end_ARG ( italic_r ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (2)

where wi=Ti(1exp(σiδi))subscript𝑤𝑖subscript𝑇𝑖1subscript𝜎𝑖subscript𝛿𝑖w_{i}=T_{i}(1-\exp(-\sigma_{i}\delta_{i}))italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - roman_exp ( - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ), Ti=exp(j=1i1σjδj)subscript𝑇𝑖superscriptsubscript𝑗1𝑖1subscript𝜎𝑗subscript𝛿𝑗T_{i}=\exp(-\sum_{j=1}^{i-1}\sigma_{j}\delta_{j})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_exp ( - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), δi=ti+1tisubscript𝛿𝑖subscript𝑡𝑖1subscript𝑡𝑖\delta_{i}=t_{i+1}-t_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and N𝑁Nitalic_N is the number of samples along the ray.

Object surface rendering. For each object given its 2d instance mask, a separate neural field is trained using rays that travel through pixels inside the 2d bounding box of the object [10]. To get a clear boundary of each foreground object, 2d opacity is rendered by summing up termination probability wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at each point xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT along the ray r𝑟ritalic_r.

O^(r)=i=1Nwi^𝑂𝑟superscriptsubscript𝑖1𝑁subscript𝑤𝑖\displaystyle\hat{O}(r)=\sum_{i=1}^{N}w_{i}over^ start_ARG italic_O end_ARG ( italic_r ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (3)

Object radiance field is encouraged to be occupied for the pixels within object mask and to be empty for pixels outside object mask. To avoid learning empty signal for occluded parts, [3][10] terminate ray right after it hits the surface of other objects. Both background and foreground objects are supervised by rendered color, depth, and opacity loss.

3D learning of object category. To learn the geometry and appearance of the category and disentangle shape and texture variations for each object in the category, CodeNeRF consists of two parts. The first part is responsible for geometry, taking 3d point x𝑥xitalic_x and shape code zssubscript𝑧𝑠z_{s}italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as input and producing volume density σ𝜎\sigmaitalic_σ and feature vector v𝑣vitalic_v. The second part is responsible for the appearance, taking v𝑣vitalic_v and texture code ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as input and outputting RGB color c𝑐citalic_c.

Fθ:FθssFθttFθss:(γx(x),zs)(σ,v)Fθtt:(v,γd(d),zt)c:subscript𝐹𝜃superscriptsubscript𝐹subscript𝜃𝑠𝑠superscriptsubscript𝐹subscript𝜃𝑡𝑡superscriptsubscript𝐹subscript𝜃𝑠𝑠:subscript𝛾𝑥𝑥subscript𝑧𝑠𝜎𝑣superscriptsubscript𝐹subscript𝜃𝑡𝑡:𝑣subscript𝛾𝑑𝑑subscript𝑧𝑡𝑐\begin{split}F_{\theta}&:F_{\theta_{s}}^{s}\circ F_{\theta_{t}}^{t}\\ F_{\theta_{s}}^{s}&:(\gamma_{x}(x),z_{s})\rightarrow(\sigma,v)\\ F_{\theta_{t}}^{t}&:(v,\gamma_{d}(d),z_{t})\rightarrow c\end{split}start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_CELL start_CELL : italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∘ italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_CELL start_CELL : ( italic_γ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x ) , italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) → ( italic_σ , italic_v ) end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL start_CELL : ( italic_v , italic_γ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ) , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) → italic_c end_CELL end_ROW (4)

III-B Category-level registration

Learning the shared 3D information of objects within the same category requires aligning these objects in a 3D space accurately. Existing category-level 3D learning methods either operate exclusively on synthetic datasets [8][15], where 3D points are represented in an object-centric coordinate, or rely on pose from pretrained 3D detection networks [9][11]. Therefore it is difficult to apply these methods for categories with limited 3D data. In typical indoor settings, another challenge arises where objects may be partially observed due to occlusions or a limited number of viewpoints. This makes the alignment of objects into a common coordinate frame difficult. Moreover, objects predicted to be in the same category might have large differences in shape, making training a shared model inefficient. In this section, we address these challenges with the following strategies.

Uncertainty guided representative selection. We select a well-observed object as a representative for its category and use it in subsequent registration stages. Identifying well-observed objects in cluttered environments, especially with occlusions, is non-trivial. We tackle this by evaluating ray uncertainty in various viewing directions using the NeRF model trained for individual objects. First, we train an object-level model for each object by utilizing our batch version implementation of vMAP[10]. The trained network serves as a compact memory for observations of each object. Inspired by [33], we calculate ray uncertainty by analyzing the weight distribution predicted by the network along the ray.

Rays that travel the regions accurately learned by the object-level model have a clear peak in weight distribution, while rays that travel poorly learned parts have noisier peaks. The concentration of a weight distribution of ray r𝑟ritalic_r can be quantified using entropy H(r)=i=1Nwilog(wi)𝐻𝑟superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝑤𝑖H(r)=-\sum_{i=1}^{N}w_{i}\log(w_{i})italic_H ( italic_r ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Additionally, from Eq. 3, the sum of weights can determine whether the ray crosses the empty space only. From these properties, we define a reliability metric g(u(r))𝑔𝑢𝑟g(u(r))italic_g ( italic_u ( italic_r ) ) as explained below. First, we design function u𝑢uitalic_u from weight distribution as:

u(r)=(i=1Nwi)exp(αH(r)),𝑢𝑟superscriptsubscript𝑖1𝑁subscript𝑤𝑖𝛼𝐻𝑟\displaystyle u(r)=\left(\sum_{i=1}^{N}w_{i}\right)\cdot\exp(-\alpha\cdot H(r)% )\;,italic_u ( italic_r ) = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ roman_exp ( - italic_α ⋅ italic_H ( italic_r ) ) , (5)

where 0wi10subscript𝑤𝑖10\leq w_{i}\leq 10 ≤ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 1, H(r)0𝐻𝑟0H(r)\geq 0italic_H ( italic_r ) ≥ 0, and 0u(r)10𝑢𝑟10\leq u(r)\leq 10 ≤ italic_u ( italic_r ) ≤ 1. Rays looking at empty regions have a low u(r)𝑢𝑟u(r)italic_u ( italic_r ), accurate regions have a high u(r)𝑢𝑟u(r)italic_u ( italic_r ), and uncertain regions have a medium u(r)𝑢𝑟u(r)italic_u ( italic_r ). We distinguish uncertain regions using another reliability function g𝑔gitalic_g designed as the sum of two sigmoid functions:

g(u)=σ(αm(uβm))+σ(αM(uβM))𝑔𝑢𝜎subscript𝛼𝑚𝑢subscript𝛽𝑚𝜎subscript𝛼𝑀𝑢subscript𝛽𝑀\displaystyle g(u)=\sigma(-\alpha_{m}(u-\beta_{m}))+\sigma(\alpha_{M}(u-\beta_% {M}))italic_g ( italic_u ) = italic_σ ( - italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_u - italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) + italic_σ ( italic_α start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_u - italic_β start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) ) (6)
αMsubscript𝛼𝑀\displaystyle\alpha_{M}italic_α start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT =2log(κ1κ)M2M1,αm=2log(κ1κ)m2m1,formulae-sequenceabsent2𝜅1𝜅subscript𝑀2subscript𝑀1subscript𝛼𝑚2𝜅1𝜅subscript𝑚2subscript𝑚1\displaystyle=\frac{2\log(\frac{\kappa}{1-\kappa})}{M_{2}-M_{1}},\;\alpha_{m}=% \frac{2\log(\frac{\kappa}{1-\kappa})}{m_{2}-m_{1}},= divide start_ARG 2 roman_log ( divide start_ARG italic_κ end_ARG start_ARG 1 - italic_κ end_ARG ) end_ARG start_ARG italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 2 roman_log ( divide start_ARG italic_κ end_ARG start_ARG 1 - italic_κ end_ARG ) end_ARG start_ARG italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ,
βMsubscript𝛽𝑀\displaystyle\beta_{M}italic_β start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT =M1+M22,βm=m1+m22.formulae-sequenceabsentsubscript𝑀1subscript𝑀22subscript𝛽𝑚subscript𝑚1subscript𝑚22\displaystyle=\frac{M_{1}+M_{2}}{2},\;\beta_{m}=\frac{m_{1}+m_{2}}{2}\;.= divide start_ARG italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG , italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG .

Rays looking at empty regions or accurate regions have a high g(u(r))𝑔𝑢𝑟g(u(r))italic_g ( italic_u ( italic_r ) ), and uncertain regions have a low g(u(r))𝑔𝑢𝑟g(u(r))italic_g ( italic_u ( italic_r ) ).

Given the RGB-D sequence and object masks, we acquire the observed point cloud for each object. We then cast rays uniformly from points on a spherical surface, which has a radius 1.2 times the largest dimension of the observed point cloud, to their antipodal points. In this way, we can predict uncertainty for all parts of each object as represented in Fig. 5. For each category, the object with the highest percentage of rays of which reliability is above a threshold η𝜂\etaitalic_η is chosen as representative.

Registration of objects to the representative. To represent various objects within the same category in a consistent object-centric coordinate, we align each object to the representative of its category using point cloud registration algorithms. We use Teaser++ [34] because it robustly aligns objects that look slightly different or are partially observed. After alignment, we adjust oriented bounding box (OBB) according to the aligned pose, resulting the bound being close to ground truth for partially observed objects. The refined OBB is used to map world coordinate to aligned coordinate in unit cube which is called Normalized Object Centric Space (NOCS) [9] in the further training stage.

Subcategorization. It is unclear whether sharing the shape information is beneficial for the 3D learning of objects that significantly vary in shape despite belonging to the same category. Especially in scenarios like ours, where learning is confined to a limited set of observed objects without prior knowledge, articulating the advantages of training a single model becomes more challenging.

Such a strategy might consume network capacity inefficiently, deteriorating performance compared to individual training. To mitigate this, we conduct subcategorization, as shown in Fig. 3, based on a simple assumption that objects poorly aligned to each other do not share meaningful information about shape. Thanks to the robustness of our registration module against noise and partial observations, this assumption stands. For quantitative evaluation of registration, we use the unidirectional Chamfer distance CDunidir(P,Q)=1|P|pPminqQ|pq|𝐶subscript𝐷𝑢𝑛𝑖𝑑𝑖𝑟𝑃𝑄1𝑃subscript𝑝𝑃subscriptmin𝑞𝑄𝑝𝑞CD_{unidir}(P,Q)=\frac{1}{\left|P\right|}\sum_{p\in P}\text{min}_{q\in Q}\left% |p-q\right|italic_C italic_D start_POSTSUBSCRIPT italic_u italic_n italic_i italic_d italic_i italic_r end_POSTSUBSCRIPT ( italic_P , italic_Q ) = divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT min start_POSTSUBSCRIPT italic_q ∈ italic_Q end_POSTSUBSCRIPT | italic_p - italic_q |, where P𝑃Pitalic_P and Q𝑄Qitalic_Q are the point cloud of the object to be aligned and the point cloud of the representative object, respectively. If the Chamfer distance is larger than the threshold, the object is distinguished by a different subcategory from the representative object. We chose unidirectional metric because P𝑃Pitalic_P is often partially observed.

Refer to caption
Figure 3: A schematic diagram of subcategorization module

III-C Model training

Given N𝑁Nitalic_N categories in the scene detected from the sequence, we employ one large model for the background and N𝑁Nitalic_N smaller models dedicated to each category. Each category-level model follows a structure similar to CodeNeRF. However, as the number of objects trained in each category is generally much fewer than that in CodeNeRF, we utilize a much smaller model. Since our objective leans more towards 3D reconstruction rather than view synthesis, we do not incorporate the viewing direction. Both the background and individual category-level models are trained from scratch. The training of our background model is the same as that of vMAP except that our method is a batch approach.

In every training iteration of the category-level model, we randomly sample training pixels gathered from 2D bounding boxes of objects in the category. This straightforward approach is enough to let the category-level model learn shape information of all the observed parts of the objects in its category. We sample 10 points along each ray r𝑟ritalic_r using depth-guided sampling proposed in vMAP. Each sampled point xiwsuperscriptsubscriptx𝑖𝑤{}^{w}\textrm{x}_{i}start_FLOATSUPERSCRIPT italic_w end_FLOATSUPERSCRIPT x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is mapped to xiosuperscriptsubscriptx𝑖𝑜{}^{o}\textrm{x}_{i}start_FLOATSUPERSCRIPT italic_o end_FLOATSUPERSCRIPT x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represented in NOCS using pose and 3d bound of its object. The point is then fed into the category-level model, conditioned by the current estimate of the shape and texture code corresponding to the object that the point belongs to. For each input point xiosuperscriptsubscriptx𝑖𝑜{}^{o}\textrm{x}_{i}start_FLOATSUPERSCRIPT italic_o end_FLOATSUPERSCRIPT x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the model outputs the occupancy probability oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and color cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each point xiosuperscriptsubscriptx𝑖𝑜{}^{o}\textrm{x}_{i}start_FLOATSUPERSCRIPT italic_o end_FLOATSUPERSCRIPT x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The termination probability, i.e, the weight wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, becomes wi=oij=1i1(1oj)subscript𝑤𝑖subscript𝑜𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝑜𝑗w_{i}=o_{i}\prod_{j=1}^{i-1}(1-o_{j})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Using volume rendering, color, depth, and opacity are rendered in the form of a weighted sum. Training proceeds by object-level supervision of depth, color, opacity, and regularization [8] derived from the prior of the latent vector.

L=k=1K𝐿superscriptsubscript𝑘1𝐾\displaystyle L=\sum_{k=1}^{K}italic_L = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT Ldepthk+λ1Lcolork+λ2Lopacityk+λ3Lregk,superscriptsubscript𝐿𝑑𝑒𝑝𝑡𝑘subscript𝜆1superscriptsubscript𝐿𝑐𝑜𝑙𝑜𝑟𝑘subscript𝜆2superscriptsubscript𝐿𝑜𝑝𝑎𝑐𝑖𝑡𝑦𝑘subscript𝜆3superscriptsubscript𝐿𝑟𝑒𝑔𝑘\displaystyle L_{depth}^{k}+\lambda_{1}L_{color}^{k}+\lambda_{2}L_{opacity}^{k% }+\lambda_{3}L_{reg}^{k}\;,italic_L start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , (7)
Ldepthksuperscriptsubscript𝐿𝑑𝑒𝑝𝑡𝑘\displaystyle L_{depth}^{k}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT =rRkMk(r)|D^(r)D(r)|,absentsubscript𝑟superscript𝑅𝑘superscript𝑀𝑘𝑟^𝐷𝑟𝐷𝑟\displaystyle=\sum_{r\in R^{k}}M^{k}(r)\left|\hat{D}(r)-D(r)\right|\;,= ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_r ) | over^ start_ARG italic_D end_ARG ( italic_r ) - italic_D ( italic_r ) | , (8)
Lcolorksuperscriptsubscript𝐿𝑐𝑜𝑙𝑜𝑟𝑘\displaystyle L_{color}^{k}italic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT =rRkMk(r)|C^(r)C(r)|,absentsubscript𝑟superscript𝑅𝑘superscript𝑀𝑘𝑟^𝐶𝑟𝐶𝑟\displaystyle=\sum_{r\in R^{k}}M^{k}(r)\left|\hat{C}(r)-C(r)\right|\;,= ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_r ) | over^ start_ARG italic_C end_ARG ( italic_r ) - italic_C ( italic_r ) | , (9)
Lopacityksuperscriptsubscript𝐿𝑜𝑝𝑎𝑐𝑖𝑡𝑦𝑘\displaystyle L_{opacity}^{k}italic_L start_POSTSUBSCRIPT italic_o italic_p italic_a italic_c italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT =rRk|O^(r)Mk(r)|,absentsubscript𝑟superscript𝑅𝑘^𝑂𝑟superscript𝑀𝑘𝑟\displaystyle=\sum_{r\in R^{k}}\left|\hat{O}(r)-M^{k}(r)\right|\;,= ∑ start_POSTSUBSCRIPT italic_r ∈ italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | over^ start_ARG italic_O end_ARG ( italic_r ) - italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_r ) | , (10)
Lregksuperscriptsubscript𝐿𝑟𝑒𝑔𝑘\displaystyle L_{reg}^{k}italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT =zsk22+ztk22,absentsuperscriptsubscriptnormsuperscriptsubscript𝑧𝑠𝑘22superscriptsubscriptnormsuperscriptsubscript𝑧𝑡𝑘22\displaystyle=\left\|z_{s}^{k}\right\|_{2}^{2}+\left\|z_{t}^{k}\right\|_{2}^{2% }\;,= ∥ italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (11)

where k(=1,2,,K)k\left(=1,2,\ldots,K\right)italic_k ( = 1 , 2 , … , italic_K ), Rksuperscript𝑅𝑘R^{k}italic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, Mksuperscript𝑀𝑘M^{k}italic_M start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are index, 2d bounding box, and 2d object mask of the object, respectively.

Refer to caption
Figure 4: Reconstruction of unobserved region

IV EXPERIMENT

We evaluate our object-level map** system both qualitatively and quantitatively on synthetic and real-world datasets, comparing it with prior state-of-the-art neural implicit scene and object reconstruction methods. We thoroughly assess the role of each component in our system through ablation study.

IV-A Experiment setup

Implementation details. All experiments are conducted using an NVIDIA RTX A5000 GPU. We compare our method with the most closely related method, vMAP. For a fair comparison with vMAP, we modify vMAP from its original online format to a batch version, denoted as vMAP*, and set hyperparameters that determine the trade-off between quality and efficiency as follows. In every iteration, we sample 120×nobjncls120subscript𝑛𝑜𝑏𝑗subscript𝑛𝑐𝑙𝑠120\times\frac{n_{obj}}{n_{cls}}120 × divide start_ARG italic_n start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT end_ARG and 120120120120 pixels for each category-level model of our method and instance-level model of vMAP*, respectively. Both methods utilize 4-layer MLPs with hidden dimensions of 128 for the background and 32 for the foreground. Our category-level model adopts architecture similar to CodeNeRF to incorporate shape and texture codes, each with a dimension of 32. In this way, we utilize more network parameters while maintaining a similar or smaller number of total parameters per test scene. We set 1×1041superscript1041\times 10^{4}1 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and 1.5×1041.5superscript1041.5\times 10^{4}1.5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT for our method and vMAP*, respectively. Both methods take 20-30 minutes for training.

For training speed, we adopt vMAP’s vectorized training scheme for category models. We utilize object masks that maintain temporal consistency between frames. These are either provided by the dataset or, in the case where object masks are not provided, can be obtained by leveraging the semantic and spatial consistency mentioned in vMAP for association.

For the registration process, we employ the official Teaser++ Python code, executing under the simultaneous pose and correspondence (SPC) setting in the paper. To enhance registration robustness, the template undergoes transformation to one of 24 possible initial poses defined by the OBB that fits the template. The alignment that results in the smallest unidirectional Chamfer distance is selected. In the subcategorization process, the unidirectional Chamfer distance is normalized using the largest dimension of the template point cloud to ensure consistent thresholds across various category scales.

Refer to caption
Figure 5: Visualization of proposed reliability metric. Color value in each spherical surface point indicates u(r)𝑢𝑟u(r)italic_u ( italic_r ) and g(u)𝑔𝑢g(u)italic_g ( italic_u ) value of the point’s corresponding ray. Both plots are oriented as same as the object in the scene.

All the experiments use the same hyperparameters as follows: threshold in representative selection η𝜂\etaitalic_η is set to be 0.5. In the reliability metric, as shown in Fig. 5, three types of regions (i.e., well observed, unobserved, empty) have different range of u(r)𝑢𝑟u(r)italic_u ( italic_r ), so we set m1=0.1subscript𝑚10.1m_{1}=0.1italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.1, m2=0.15subscript𝑚20.15m_{2}=0.15italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.15, M1=0.57subscript𝑀10.57M_{1}=0.57italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.57, M2=0.65subscript𝑀20.65M_{2}=0.65italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.65, and κ1=0.9subscript𝜅10.9\kappa_{1}=0.9italic_κ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 following the transition region for u(r)𝑢𝑟u(r)italic_u ( italic_r ). Especially, we set the threshold for subcategorization η𝜂\etaitalic_η as 0.12. Note that we select this value conservatively low to avoid the severe case where very different objects are misclassified into the same subcategory and damage their reconstruction. The loss weights are set at λ1=5subscript𝜆15\lambda_{1}=5italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5, λ2=10subscript𝜆210\lambda_{2}=10italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 10, and λ3=0.0005subscript𝜆30.0005\lambda_{3}=0.0005italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0.0005.

TABLE I: Object-level reconstruction results on 8 Replica scenes
room-0 room-1 room-2 office-0 office-1 office-2 office-3 office-4
TSDF-Fusion* Acc. 2.80 3.00 4.10 3.19 2.44 2.94 3.21 3.75
Comp. 3.79 5.06 4.23 3.08 4.45 4.17 3.63 3.65
C.R. 77.22 73.36 77.75 83.45 80.48 81.45 80.26 80.68
iMAP* Acc. 3.16 3.30 4.36 3.48 2.64 3.12 3.68 4.06
Comp. 2.79 4.16 3.89 3.31 3.32 3.13 3.64 4.36
C.R. 84.17 76.62 79.37 82.08 84.24 82.15 80.65 77.11
ObjectSDF++* Acc. 2.27 3.99 2.34 2.35 2.70 2.32 2.28 2.19
Comp. 3.29 5.51 4.42 3.13 3.75 3.72 3.65 6.36
C.R. 83.53 71.13 79.21 84.68 81.18 83.57 83.13 74.60
vMAP* Acc. 1.98 2.73 2.19 2.17 2.19 2.16 2.28 2.08
Comp. 1.70 2.55 2.61 1.23 3.18 2.05 2.08 2.11
C.R. 94.80 90.54 86.83 95.95 90.08 93.97 92.28 90.57
Ours Acc. 2.21 3.16 2.15 2.26 2.22 2.13 2.23 2.08
Comp. 1.67 2.24 1.92 1.22 3.15 1.89 1.33 1.83
C.R. 94.83 92.26 89.67 96.01 90.31 94.64 95.89 92.57
Refer to caption
Figure 6: Reconstruction results of Replica scenes. Especially when similar objects are present, our method better reconstructs unseen parts for various categories (ottoman, table, nightstand, lamp) compared to baselines.

Datasets. We evaluate the proposed system on Replica [35] and ScanNet [36] datasets. Replica is a synthetic dataset that provides RGB images along with ground truth (gt) poses, gt depths, and gt object/semantic masks. ScanNet is a real-world dataset that provides RGB images accompanied by gt poses, noisy depths, and noisy object/semantic masks. In experiments on Replica, we use 2000 frames of 8 room-scale scenes as utilized in iMAP [22]. In ScanNet dataset, we select 6 scenes. It should be noted that object point clouds and bounds derived from ScanNet’s noisy object masks are inaccurate due to depth discontinuities at object boundaries. Thus, as preprocessing, we generate semantically refined depth segmentation masks, as proposed in [37], and use them instead of the noisy object masks.

Metrics. For both datasets and ablation studies, we employ object-level metrics as same as vMAP except completion ratio (<1111 cm %percent\%%): accuracy [cm] (noted as Acc.), completion [cm] (noted as Comp.), and completion ratio (<5555 cm %percent\%%) (noted as C.R.). Since the accuracy measures the average error of the reconstructed parts, inferring details of unseen parts using category-level shape information might incur a metric penalty. Therefore, we evaluate accuracy for the same part of the resulting mesh of each method by cutting off using 3d OBB fitted from vMAP mesh, for fairer comparison.

IV-B Case Study

To understand how the category-level model and subcategorization in our method improve the reconstruction of unobserved parts for objects, we test the reconstruction of unobserved parts using the environment in Fig. 4. To focus on the impacts of the proposed modules, we use the ground truth pose for each object. The test environment consists of two significantly different types of chair instances where different parts of each instance are observed. For each method, two selected instances from each chair type are visualized with camera frames used for training the model that is responsible for predicting the shape of the instance. Each frame relates to a different instance, observing the instance the best in the sequence. vMAP cannot reconstruct the seat of instance (Fig. 4(a)) and oversmooths the unobserved part (Fig. 4(b)). The category-level model without subcategorization reconstructs the unobserved part in Fig. 4(a), but wrongly reconstructs in Fig. 4(b) by generating the leg from the shape information of instances in a different type. Only our method can accurately reconstruct the unobserved part, since it takes additional information from only chair instances with the same type.

In Fig. 5, we also visualize the reliability metric to check whether it plays an appropriate role. The result meets well our expectations: the observed regions as vacancy and well-observed areas have high g(u(r))𝑔𝑢𝑟g(u(r))italic_g ( italic_u ( italic_r ) ) values, whereas the unobserved regions show low g(u(r))𝑔𝑢𝑟g(u(r))italic_g ( italic_u ( italic_r ) ) values.

IV-C Evaluation on Scene and Object Reconstruction

Results on Replica dataset. Table II shows the object-level reconstruction results. Our method is compared with scene-level methods (i.e., TSDF-Fusion and iMAP) and object-level methods (i.e., ObjectSDF++ and vMAP). Since these methods except ObjectSDF++ [29] are online methods, we implement their batch mode using their open-source code and train them using the ground-truth pose (denoted with “*”) for a fair comparison. Since ObjectSDF++ uses the predicted depth from [38], we re-implement it to use the ground-truth depth and replace its scale-invariant depth loss with the usual scale-aware depth loss. We train the re-implemented ObjectSDF++ with the same iteration number as the original (2×1052superscript1052\times 10^{5}2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT iteration, 18 hours).

Our method outperforms the baselines in object-level completion and completion ratio for all the scenes and shows comparable or better accuracy than baselines. Ours and vMAP achieve significantly higher completeness than ObjectSDF++ because they show better hole-filling capability of unobserved regions than ObjectSDF++. It is because they explicitly avoid providing training information about occluded regions to their models, whereas ObjectSDF++ does not. Our method achieves better completeness than vMAP* since ours reconstructs unobserved parts more accurately using category-level information. Notably, in scenes like room-2 and office-3, where there are many occlusions and several objects of the same shape, our method improves object-level completion by 20-30%percent\%% compared to vMAP. Fig. 6 shows the reconstruction results of 2 Replica scenes. Highlighted boxes represent parts that are not visible in the given sequence, and our method outperforms the baselines in reconstructing details of these parts.

TABLE II: Object-level reconstruction results on ScanNet.
scene 0013 0024 0059 0066 0084 0281
vMAP* Acc. 4.50 3.74 4.53 4.44 2.91 4.98
Comp. 3.62 3.98 4.57 4.50 4.58 4.09
C.R. 89.05 84.63 80.05 81.33 86.11 84.20
Ours Acc. 3.61 3.43 4.23 4.10 2.82 4.56
Comp. 3.58 4.19 4.31 4.13 4.06 3.84
C.R. 89.53 84.07 81.29 82.09 86.79 85.43
Refer to caption
Figure 7: Object-level reconstruction results on ScanNet.

The bottom row shows examples of frames that capture the corresponding object.

Results on ScanNet dataset. To highlight the strengths and practicality of our system, we conduct experiments on 4 scenes - scene 0013/0059/0066/0281 - that contain several similar objects and 2 scenes that contain a small number of similarly shaped objects and from the real-world ScanNet dataset. Table III compares the reconstruction results of our approach with vMAP for all foreground objects in the ScanNet scenes. Our method outperforms vMAP in all metrics for 4 scenes with multiple similar objects, and shows comparable performance to vMAP* in 2 scenes with a few similar objects. As shown in Fig. 7, our method reconstructs the objects more completely.

Category-wise results. We compare our method with vMAP* for objects in certain categories to clarify the strength of our method in Table III. In both datasets, our method demonstrates significant enhancements over vMAP* for multiple objects that belong to the same class. Even for objects without similar ones in the scene, which is not our target, our method at least performs similarly to vMAP.

TABLE III: Object-level reconstruction results for selected classes. room-2 contains multiple chairs and a single box, and scene0066 contains multiple chairs and a single table.
room-2 scene0066
chair box chair table
Acc. 3.67 0.87 4.96 5.41
vMAP* Comp. 4.83 0.89 6.89 1.99
C.R. 78.82 100 70.65 96.90
Acc. 3.42 0.81 4.58 5.44
Ours Comp. 2.03 0.81 5.77 2.02
C.R. 94.48 100 73.56 96.79
TABLE IV: Object-level reconstruction results of ablation study. The performances of representative selection and subcategorization methods are computed using the categories with more than 5 and 2 objects included, respectively.
Acc. Comp. C.R.
Representative Random 1.83 2.17 91.59
Ours 1.82 1.87 92.70
Subcategorization No 2.42 2.24 91.83
Ours 2.14 2.07 92.77
Refer to caption
Figure 8: Category-level registration result. OBB, 6DoF pose, and subcategory for each chair instance are visualized. Note that the chair class is colored by two different colors, which means that the chair class is divided into two subcategories.

IV-D Ablation study

We compare the 3D reconstruction performance between the proposed uncertainty-guided method and simple random selection for selecting representative objects. Table IV indicates that, especially in scenarios with many objects belonging to the same category where choosing an appropriate representative is vital, both accuracy and completion are superior when the proposed method is utilized. Table IV also denotes that both accuracy and completion are superior when subcategorization is applied compared to when it is not.

In addition, Fig. 8 displays the estimated poses and bounds of the observed objects that belong to a selected category (chair) along with subcategorization outcomes, as the result of category-level registration. Our method can estimate the bounds and poses of chairs that are occluded by table consistently with other objects in the same category. We also compare the memory usage with vMAP*. Our single category-level model has 18179 parameters, whereas vMAP*’s single instance-level model has 11363 parameters. Note that our model can learn all objects in the category using a smaller amount of memory than vMAP*’s several instance-level models, making our system much more memory-efficient in scenes with many objects in the same category.

V CONCLUSIONS

We proposed an object-level map** system with category-level neural fields. The objective of our system is to enhance the reconstruction of unobserved regions using category-level models without any information learned from external data. To achieve this, we introduce a category-level registration module that maps objects in the same category to the same normalized object-centric space and subcategorizes objects to enforce the category-level model to learn only objects that share meaningful information about shape. Then the model learns neural fields with independent shape and appearance components for each object. Experiments on synthetic and real-world datasets show that the shape information of objects with similar shapes can be successfully leveraged to reconstruct a more complete mesh.

References

  • [1] Y. Nie, X. Han, S. Guo, Y. Zheng, J. Chang, and J. J. Zhang, “Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 55–64.
  • [2] K. Li, H. Rezatofighi, and I. Reid, “Moltr: Multiple object localization, tracking and reconstruction from monocular rgb videos,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 3341–3348, 2021.
  • [3] B. Yang, Y. Zhang, Y. Xu, Y. Li, H. Zhou, H. Bao, G. Zhang, and Z. Cui, “Learning object-compositional neural radiance field for editable scene rendering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 779–13 788.
  • [4] J. Ost, F. Mannan, N. Thuerey, J. Knodt, and F. Heide, “Neural scene graphs for dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2856–2865.
  • [5] H.-X. Yu, L. J. Guibas, and J. Wu, “Unsupervised discovery of object radiance fields,” arXiv preprint arXiv:2107.07905, 2021.
  • [6] Y. Wu, Y. Zhang, D. Zhu, Z. Deng, W. Sun, X. Chen, and J. Zhang, “An object slam framework for association, map**, and high-level tasks,” IEEE Transactions on Robotics, 2023.
  • [7] Q. Wu, X. Liu, Y. Chen, K. Li, C. Zheng, J. Cai, and J. Zheng, “Object-compositional neural implicit surfaces,” in European Conference on Computer Vision.   Springer, 2022, pp. 197–213.
  • [8] W. Jang and L. Agapito, “Codenerf: Disentangled neural radiance fields for object categories,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12 949–12 958.
  • [9] N. Müller, A. Simonelli, L. Porzi, S. R. Bulo, M. Nießner, and P. Kontschieder, “Autorf: Learning 3d object radiance fields from single view observations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3971–3980.
  • [10] X. Kong, S. Liu, M. Taher, and A. J. Davison, “vmap: Vectorised object map** for neural field slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 952–961.
  • [11] A. Kundu, K. Genova, X. Yin, A. Fathi, C. Pantofaru, L. J. Guibas, A. Tagliasacchi, F. Dellaert, and T. Funkhouser, “Panoptic neural fields: A semantic object-aware neural scene representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 871–12 881.
  • [12] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger, “Occupancy networks: Learning 3d reconstruction in function space,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4460–4470.
  • [13] S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger, “Convolutional occupancy networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16.   Springer, 2020, pp. 523–540.
  • [14] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3504–3515.
  • [15] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 165–174.
  • [16] L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, “Volume rendering of neural implicit surfaces,” Advances in Neural Information Processing Systems, vol. 34, pp. 4805–4815, 2021.
  • [17] P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” arXiv preprint arXiv:2106.10689, 2021.
  • [18] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European Conference on Computer Vision.   Springer, 2020, pp. 405–421.
  • [19] Z. Yu, S. Peng, M. Niemeyer, T. Sattler, and A. Geiger, “Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction,” Advances in neural information processing systems, vol. 35, pp. 25 018–25 032, 2022.
  • [20] D. Azinović, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural rgb-d surface reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6290–6301.
  • [21] J. Wang, T. Bleja, and L. Agapito, “Go-surf: Neural feature grid optimization for fast, high-fidelity rgb-d surface reconstruction,” in 2022 International Conference on 3D Vision (3DV).   IEEE, 2022, pp. 433–442.
  • [22] E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit map** and positioning in real-time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6229–6238.
  • [23] Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “Nice-slam: Neural implicit scalable encoding for slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 786–12 796.
  • [24] A. Rosinol, J. J. Leonard, and L. Carlone, “Nerf-slam: Real-time dense monocular slam with neural radiance fields,” arXiv preprint arXiv:2210.13641, 2022.
  • [25] H. Wang, J. Wang, and L. Agapito, “Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 293–13 302.
  • [26] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics (ToG), vol. 41, no. 4, pp. 1–15, 2022.
  • [27] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” in European Conference on Computer Vision.   Springer, 2022, pp. 333–350.
  • [28] Z. Li, X. Lyu, Y. Ding, M. Wang, Y. Liao, and Y. Liu, “Rico: Regularizing the unobservable for indoor compositional reconstruction,” arXiv preprint arXiv:2303.08605, 2023.
  • [29] Q. Wu, K. Wang, K. Li, J. Zheng, and J. Cai, “Objectsdf++: Improved object-compositional neural implicit surfaces,” arXiv preprint arXiv:2308.07868, 2023.
  • [30] P. Henzler, J. Reizenstein, P. Labatut, R. Shapovalov, T. Ritschel, A. Vedaldi, and D. Novotny, “Unsupervised learning of 3d object categories from videos in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4700–4709.
  • [31] C. Xie, K. Park, R. Martin-Brualla, and M. Brown, “Fig-nerf: Figure-ground neural radiance fields for 3d object category modelling,” in 2021 International Conference on 3D Vision (3DV).   IEEE, 2021, pp. 962–971.
  • [32] N. Max, “Optical models for direct volume rendering,” IEEE Transactions on Visualization and Computer Graphics, vol. 1, no. 2, pp. 99–108, 1995.
  • [33] S. Lee, L. Chen, J. Wang, A. Liniger, S. Kumar, and F. Yu, “Uncertainty guided policy for active robotic 3d reconstruction using neural radiance fields,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 12 070–12 077, 2022.
  • [34] H. Yang, J. Shi, and L. Carlone, “Teaser: Fast and certifiable point cloud registration,” IEEE Transactions on Robotics, vol. 37, no. 2, pp. 314–333, 2020.
  • [35] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al., “The replica dataset: A digital replica of indoor spaces,” arXiv preprint arXiv:1906.05797, 2019.
  • [36] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839.
  • [37] M. Grinvald, F. Furrer, T. Novkovic, J. J. Chung, C. Cadena, R. Siegwart, and J. Nieto, “Volumetric instance-aware semantic map** and 3d object discovery,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3037–3044, 2019.