GenS: Generalizable Neural Surface Reconstruction from Multi-View Images

Rui Peng1,2  Xiaodong Gu3  Luyang Tang1  Shihe Shen1  Fanqi Yu1  Ronggang Wang✉,1,2
1School of Electronic and Computer Engineering, Peking University
2Peng Cheng Laboratory  3Alibaba Group
[email protected][email protected]
Abstract

Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code is available at https://github.com/prstrive/GenS.

1 Introduction

Surface reconstruction from multi-view images is a cornerstone task in computer vision with many applications in virtual reality, autonomous driving, robotics, etc. Typical solutions kazhdan2006poisson ; kazhdan2013screened ; galliani2015massively ; schonberger2016structure ; yao2018mvsnet ; xu2019multi ; gu2020cascade ; peng2022rethinking in the past were mostly based on a multi-step pipeline, which includes depth estimation, depth fusion and meshing. Although they have demonstrated their excellent performance, the procedure is cumbersome and inevitably introduces cumulative errors. While several early works niemeyer2020differentiable ; yariv2020multiview used differentiable surface rendering to directly reconstruct surfaces, recent works oechsle2021unisurf ; wang2021neus ; yariv2021volume , inspired by the huge success of neural radiance field (NeRF) mildenhall2020nerf in synthesizing novel views, follow the volume rendering max1995optical to represent the 3D geometry with an occupancy field mescheder2019occupancy or signed distance function (SDF) park2019deepsdf and can achieve more impressive results.

The key idea of these approaches is to train a compact multi-layer perceptrons (MLPs) to predict the implicit representation (e,g., SDF) of each sampled point on camera rays. The density of volume rendering is then regarded as a function of this implicit representation, and alpha-composition of samples is performed to produce the corresponding pixel color and geometry information. However, existing methods are hampered by requiring a lengthy per-scene optimization procedure and cannot generalize to new scenes, which makes them infeasible for many application scenarios. A recent method long2022sparseneus attempts to address these issues through conditioning the SDF-induced model with features extracted from sparse nearby views. Nevertheless, its accuracy is limited due to the smooth reconstruction, and the multi-stage process it relies on is prone to introducing cumulative errors. In this paper, we seek to establish an end-to-end generalizable model which can efficiently infer finer 3D structure. Compared with existing methods wang2021neus ; yariv2021volume , this generalization system faces more challenging problems. First, it’s non-trivial to efficiently represent the scene. Previous methods murez2020atlas ; long2022sparseneus ; chen2021mvsnerf ; wang2021ibrnet either build a global volume or employ feature projections, but they have proven to be either lacking in detail or unsuitable for view independent surface reconstruction. Second, relying only on the rendering loss is difficult to reconstruct compact geometry, since the multi-view consistency is ignored. And we found that the ordinary photometric consistency also cannot effectively solve this problem for our generalizable model because of the existence of ambiguous areas such as low-texture and reflection. Last but not least, since generalizable models heavily rely on aggregation quality, how to infer smooth geometry when the input is sparse is a thorny issue.

Refer to caption
Figure 1: Qualitative comparisons on DTU and BlendedMVS datasets with sparse inputs.

To this end, we introduce GenS to tackle these challenges. The main ideas behind are as follows: 1) We first construct a generalized multi-scale volume to represent the scene, which preserves global smoothness through the low-resolution volumes and recovers geometric details from high-resolution volumes. Meanwhile, low-dimensional features make our model more lightweight compared to a single large-width volume. 2) We introduce the multi-scale feature-metric consistency, which enforces multi-view consistency in the multi-scale feature space, to replace the common photometric consistency. Compared with the original image space, learnable multi-scale features can provide more discriminative representation, and the feature space can be self-enhanced during the generalization training process to continuously improve the matching accuracy. 3) Inspired by the fact that the reconstruction with dense inputs is more accurate, we propose a view contrast loss to force the model to better perceive the geometry of regions visible by few viewpoints through teaching the reconstruction from sparse inputs with dense inputs.

To demonstrate the quantitative and qualitative effectiveness of GenS, we conduct extensive experiments on DTU jensen2014large and BlendedMVS yao2020blendedmvs datasets. Results show that our model can outperform existing state-of-the-art generalizable method long2022sparseneus , and even recent method ren2022volrecon which adopts the ground-truth depth for supervision. Compared with the per-scene overfitting methods wang2021neus ; yariv2021volume ; yariv2020multiview ; oechsle2021unisurf ; long2022sparseneus , we can also achieve comparable or superior results with dense inputs. Some comparisons are shown in Fig. 1. In summary, our main contributions are highlighted below: a) We present a powerful representation based on our generalized multi-scale volume, which can efficiently reconstruct smooth and detail surfaces from multi-view images. b) We introduce a more discriminative multi-scale feature-metric consistency to successfully constrain the geometry, which helps the generalization model converge to the optimum. c) We propose a view contrast loss to improve the geometric smoothness and accuracy when the visible viewpoint is limited. d) Our model can be trained end-to-end and achieve state-of-the-art reconstructions in both generic setting and per-scene optimization setting.

Refer to caption
Figure 2: Illustration of GenS. We first extract multi-scale features through a FPN network. The generalized multi-scale volume is then reconstructed with the corresponding scale feature. We employ the same blending strategy as wang2021ibrnet to estimate the appearance of each point on a ray, and adopt the volume rendering to recover the color of a pixel. We design the multi-scale feature-metric consistency to constrain the geometry as shown in the top right. For convenience, we omit some losses that will be detailed later.

2 Related work

Classical multi-view reconstruction.

Reconstructing 3D geometry from multi-view images is a longstanding problem in the field of 3D vision. Classical algorithms mainly adopt depth-based or voxel-based methodology to solve this problem. Multi-view stereo (MVS) is a typical class of depth-based methods, which takes stereo correspondence from multiple images as the main cue to reconstruct depth maps. While previous traditional MVS methods barnes2009patchmatch ; schonberger2016structure ; galliani2015massively ; furukawa2009accurate ; xu2019multi ; schonberger2016pixelwise relied on the hand-crafted similarity metrics, many recent learning-based methods yao2018mvsnet ; gu2020cascade ; wang2021patchmatchnet ; peng2022rethinking apply deep learning to achieve more discriminative matching. These methods go through complicated procedures to retrieve surface, including depth filtering, fusion and meshing kazhdan2006poisson ; bernardini1999ball , and are prone to cumulative errors. On the other hand, voxel-based methods seitz1999photorealistic ; kutulakos2000theory ; izadi2011kinectfusion ; niessner2013real directly model objects in a volume, but they are restricted to memory, which is the common drawback of the volumetric representation, and cannot achieve high accuracy.

Neural surface.

Due to the notable advantages of being able to achieve high spatial resolution, neural implicit functions have recently gained a lot of attention and have emerged as an effective representation of 3D geometry takikawa2021neural ; mescheder2019occupancy ; park2019deepsdf ; genova2019learning ; michalkiewicz2019implicit ; niemeyer2019occupancy ; peng2020convolutional ; saito2019pifu and appearance liu2020neural ; mildenhall2020nerf ; liu2020dist ; oechsle2019texture ; sun2022direct ; yu2021plenoctrees ; pumarola2021d ; muller2022instant . Furthermore, some works mildenhall2020nerf ; liu2019learning ; niemeyer2020differentiable have proposed to train models without 3D supervision via differentiable rendering, e.g., surface rendering and volume rendering. Methods adopt surface rendering niemeyer2020differentiable ; yariv2020multiview ; zhang2021learning only consider a single surface intersection point for each ray and fail to reconstruct complex objects, and they are restricted by the need of accurate object masks and careful weight initialization. On the contrary, recent methods use volume rendering oechsle2021unisurf ; wang2021neus ; yariv2021volume ; yu2022monosdf to take multiple points along the ray into consideration and achieve more impressive results. However, either type of method requires an expensive per-scene optimization and cannot generalize to new scenes.

Generalizable neural surface.

In the field of novel view synthesis, some methods chen2021mvsnerf ; wang2021ibrnet ; yu2021pixelnerf ; johari2022geonerf have successfully introduced the generalization into rendering methods. These methods suffer from the same problem as NeRF: the geometry is ambiguous. Few works have focused on the generalization of neural surface reconstruction. A recent study, SparseNeuS long2022sparseneus , is the first attempt to achieve this by reconstructing the surface from nearby viewpoints in a multi-stage manner. Nevertheless, its reconstruction lacks details, and same to the classical 3D reconstruction, the multi-stage pipeline may accumulates errors at each stage. On the contrary, our designed model can be trained end-to-end and reconstruct smoother and more refined geometries.

3 Method

Given N𝑁Nitalic_N posed images of an object taken from different viewpoints, our goal is to reconstruct the surface as an implicit function without expensive per-scene optimization or only by fast fine-tuning. Our overall framework is depicted in Fig 2. We first introduce how to infer the geometry and appearance from the generalized multi-scale volume in Sec. 3.1, then elaborate on the necessity and implementation of the multi-scale feature-metric consistency in Sec. 3.2, and finally detail the realization of view contrast loss in Sec. 3.3 and the overall pipeline in Sec. 3.4.

3.1 Geometry and color reasoning from the generalized multi-scale volume

Compared with existing solution long2022sparseneus , which relies on a single volume and multi-stage strategy, we have three main advantages. First of all, our generalized multi-scale volume is a more powerful representation, which implicitly decouples geometry into base structures in low-resolution volumes and high-frequency details in high-resolution volumes. Second, with the low-dimensional features, we can construct multi-scale volumes with higher resolution and less memory consumption than a single large-width volume. Besides, our model can be trained end-to-end, avoiding cumulative errors.

Generalized multi-scale volume construction.

Suppose there are N𝑁Nitalic_N images {Ii}i=0N1superscriptsubscriptsubscript𝐼𝑖𝑖0𝑁1\{I_{i}\}_{i=0}^{N-1}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT of an object, we first apply the FPN network lin2017feature to extract multi-scale feature maps {Fij}i,j=0,0N1,L1superscriptsubscriptsuperscriptsubscript𝐹𝑖𝑗formulae-sequence𝑖𝑗00𝑁1𝐿1\{F_{i}^{j}\}_{i,j=0,0}^{N-1,L-1}{ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i , italic_j = 0 , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 , italic_L - 1 end_POSTSUPERSCRIPT for all images with shared weights, and different volumes are then constructed from features at corresponding scales. In this paper, we define a bounding box of interest in the reference frustum like long2022sparseneus and in the world coordinate system like wang2021neus ; zhang2020nerf++ when dense inputs are available. We adopt a combination of L𝐿Litalic_L volumes {Vj}j=0L1superscriptsubscriptsubscript𝑉𝑗𝑗0𝐿1\{V_{j}\}_{j=0}^{L-1}{ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT, which cover the same region but with different resolutions Ch×D2j×H2j×W2j𝐶𝐷superscript2𝑗𝐻superscript2𝑗𝑊superscript2𝑗Ch\times\frac{D}{2^{j}}\times\frac{H}{2^{j}}\times\frac{W}{2^{j}}italic_C italic_h × divide start_ARG italic_D end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG × divide start_ARG italic_H end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG.

Here, we discuss at the first scale and omit the scale subscript j𝑗jitalic_j for convenience. Given camera intrinsics {Ki}i=0N1superscriptsubscriptsubscript𝐾𝑖𝑖0𝑁1\{K_{i}\}_{i=0}^{N-1}{ italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT and extrinsics {[R,t]i}i=0N1superscriptsubscriptsubscript𝑅𝑡𝑖𝑖0𝑁1\{[R,t]_{i}\}_{i=0}^{N-1}{ [ italic_R , italic_t ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT, we first project the voxel v=(x,y,z)𝑣𝑥𝑦𝑧v=(x,y,z)italic_v = ( italic_x , italic_y , italic_z ) onto viewpoint i𝑖iitalic_i’s pixel position:

qi(v)=π(KiRiT(vti)),subscript𝑞𝑖𝑣𝜋subscript𝐾𝑖superscriptsubscript𝑅𝑖𝑇𝑣subscript𝑡𝑖q_{i}(v)=\pi(K_{i}R_{i}^{T}(v-t_{i})),italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v ) = italic_π ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_v - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (1)

where π((x,y,z)T)=(xz,yz)T𝜋superscript𝑥𝑦𝑧𝑇superscript𝑥𝑧𝑦𝑧𝑇\pi((x,y,z)^{T})=(\frac{x}{z},\frac{y}{z})^{T}italic_π ( ( italic_x , italic_y , italic_z ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) = ( divide start_ARG italic_x end_ARG start_ARG italic_z end_ARG , divide start_ARG italic_y end_ARG start_ARG italic_z end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is an operator to convert homogeneous coordinates to cartesian coordinates. Then we can get the corresponding feature of each voxel on ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT viewpoint through bilinear interpolation Fi(v)=Fi<qi(v)>subscript𝐹𝑖𝑣subscript𝐹𝑖expectationsubscript𝑞𝑖𝑣F_{i}(v)=F_{i}<q_{i}(v)>italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v ) = italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v ) >. To fuse features from all viewpoints {Fi(v)}i=0N1superscriptsubscriptsubscript𝐹𝑖𝑣𝑖0𝑁1\{F_{i}(v)\}_{i=0}^{N-1}{ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v ) } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT, we adopt the same aggregation strategies to generate cost volume as in wang2021ibrnet that concatenates mean and variance to simultaneously capture statistical and semantic information: B(v)=[Mean(v),Var(v)]𝐵𝑣𝑀𝑒𝑎𝑛𝑣𝑉𝑎𝑟𝑣B(v)=[Mean(v),Var(v)]italic_B ( italic_v ) = [ italic_M italic_e italic_a italic_n ( italic_v ) , italic_V italic_a italic_r ( italic_v ) ].

Simply repeating the above process on features and volumes of all L𝐿Litalic_L scales, we can get the multi-scale cost volumes {Bj}j=0L1superscriptsubscriptsubscript𝐵𝑗𝑗0𝐿1\{B_{j}\}_{j=0}^{L-1}{ italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT. Next, we further design an efficient multi-scale 3D network ψ𝜓\psiitalic_ψ to refine these cost volumes in one forward, starting from the finest volume and injecting the others into different stages of the model to save memory. The output of the 3D network {Vj}j=0L1=ψ({Bj}j=0L1)superscriptsubscriptsubscript𝑉𝑗𝑗0𝐿1𝜓superscriptsubscriptsubscript𝐵𝑗𝑗0𝐿1\{V_{j}\}_{j=0}^{L-1}=\psi(\{B_{j}\}_{j=0}^{L-1}){ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT = italic_ψ ( { italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT ) is the multi-scale volume that we need to infer the geometry.

Geometry reasoning.

For an arbitrary 3D point p=(x,y,z)𝑝𝑥𝑦𝑧p=(x,y,z)italic_p = ( italic_x , italic_y , italic_z ), we first get the interpolation of volumes at all scales {Vj(p)}j=0L1superscriptsubscriptsubscript𝑉𝑗𝑝𝑗0𝐿1\{V_{j}(p)\}_{j=0}^{L-1}{ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_p ) } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT through trilinear sampling, and then concatenate them as the final feature (p)Ch1𝑝superscript𝐶subscript1\mathcal{F}(p)\in\mathbb{R}^{Ch_{1}}caligraphic_F ( italic_p ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Ch1=L×Ch𝐶subscript1𝐿𝐶Ch_{1}=L\times Chitalic_C italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_L × italic_C italic_h. Combining the feature and the point position, an MLP network is applied to predict the corresponding SDF value: sdfθ:3×Ch1:𝑠𝑑subscript𝑓𝜃superscript3superscript𝐶subscript1sdf_{\theta}:\mathbb{R}^{3}\times\mathbb{R}^{Ch_{1}}\rightarrow\mathbb{R}italic_s italic_d italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_C italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R. And the surface is represented by the zero-level set of the SDF value:

S={p3|sdfθ(p,(p))=0}.𝑆conditional-set𝑝superscript3𝑠𝑑subscript𝑓𝜃𝑝𝑝0S=\{p\in\mathbb{R}^{3}|sdf_{\theta}(p,\mathcal{F}(p))=0\}.italic_S = { italic_p ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | italic_s italic_d italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_p , caligraphic_F ( italic_p ) ) = 0 } . (2)

Color prediction.

We refer to the first viewpoint I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as the reference image. To predict the color of each point on a ray, we employ the blending strategy similar to wang2021ibrnet . We first project the 3D point p𝑝pitalic_p to source views’ pixel position according to Eq. 1, and interpolate the corresponding colors {Ii(p)}i=1N1superscriptsubscriptsubscript𝐼𝑖𝑝𝑖1𝑁1\{I_{i}(p)\}_{i=1}^{N-1}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT and features {Fi(p)}i=1N1superscriptsubscriptsubscript𝐹𝑖𝑝𝑖1𝑁1\{F_{i}(p)\}_{i=1}^{N-1}{ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT. Here, we only use the highest resolution features to predict the color. Next, an MLP network take the concatenation of features and viewing direction differences Δd=ddiΔ𝑑𝑑subscript𝑑𝑖\Delta d=d-d_{i}roman_Δ italic_d = italic_d - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as input, to predict the softmax-activated blending weights {wi(p)}i=1N1superscriptsubscriptsubscript𝑤𝑖𝑝𝑖1𝑁1\{w_{i}(p)\}_{i=1}^{N-1}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT of each source view, and the final color is blended as the weighted sum of source colors:

c(p)=i=1N1Ii(p)wi(p).𝑐𝑝superscriptsubscript𝑖1𝑁1subscript𝐼𝑖𝑝subscript𝑤𝑖𝑝c(p)=\sum_{i=1}^{N-1}I_{i}(p)w_{i}(p).italic_c ( italic_p ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p ) italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p ) . (3)

SDF-based volume rendering.

Given the density {σi}i=1Msuperscriptsubscriptsubscript𝜎𝑖𝑖1𝑀\{\sigma_{i}\}_{i=1}^{M}{ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and color {ci}i=1Msuperscriptsubscriptsubscript𝑐𝑖𝑖1𝑀\{c_{i}\}_{i=1}^{M}{ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT of M𝑀Mitalic_M samples along the ray p(t)=o+td𝑝𝑡𝑜𝑡𝑑p(t)=o+tditalic_p ( italic_t ) = italic_o + italic_t italic_d emitting from camera center o𝑜oitalic_o to pixel q𝑞qitalic_q in view direction d𝑑ditalic_d, NeRF mildenhall2020nerf approximates the color using numerical quadrature:

C^=i=1MTiαici,Ti=j=1i1(1αj),formulae-sequence^𝐶superscriptsubscript𝑖1𝑀subscript𝑇𝑖subscript𝛼𝑖subscript𝑐𝑖subscript𝑇𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗\hat{C}=\sum_{i=1}^{M}T_{i}\alpha_{i}c_{i},\ T_{i}=\prod_{j=1}^{i-1}(1-\alpha_% {j}),over^ start_ARG italic_C end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (4)

where Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the accumulated transmittance, and αi=1exp(σiδi)subscript𝛼𝑖1subscript𝜎𝑖subscript𝛿𝑖\alpha_{i}=1-\exp(-\sigma_{i}\delta_{i})italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 - roman_exp ( - italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in original volume rendering. To better approximation the geometry of the scene, NeuS wang2021neus proposed an unbiased and occlusion-aware weighting method to incorporate signed distance, and the αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is formulated as:

αi=max(Φs(sdf(p(ti)))Φs(sdf(p(ti+1)))Φs(sdf(p(ti))),0).subscript𝛼𝑖subscriptΦ𝑠𝑠𝑑𝑓𝑝subscript𝑡𝑖subscriptΦ𝑠𝑠𝑑𝑓𝑝subscript𝑡𝑖1subscriptΦ𝑠𝑠𝑑𝑓𝑝subscript𝑡𝑖0\alpha_{i}=\max(\frac{\Phi_{s}(sdf(p(t_{i})))-\Phi_{s}(sdf(p(t_{i+1})))}{\Phi_% {s}(sdf(p(t_{i})))},0).italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max ( divide start_ARG roman_Φ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s italic_d italic_f ( italic_p ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) - roman_Φ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s italic_d italic_f ( italic_p ( italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ) ) end_ARG start_ARG roman_Φ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s italic_d italic_f ( italic_p ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) end_ARG , 0 ) . (5)

Here, Φs(x)=(1+esx)1subscriptΦ𝑠𝑥superscript1superscript𝑒𝑠𝑥1\Phi_{s}(x)=(1+e^{-sx})^{-1}roman_Φ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) = ( 1 + italic_e start_POSTSUPERSCRIPT - italic_s italic_x end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the sigmoid function and s𝑠sitalic_s is an anneal factor. Readers can refer to wang2021neus for more details.

Refer to caption
Figure 3: Multi-view aggregation ambiguity. Here, we take two viewpoints as an example. (a) For those low-texture regions, sampling points near the surface may get the same aggregation and lack discriminability. (b) The aggregation of points away from the surface are random and hard to infer the accurate geometry, e.g., two sampling points may get the same aggregation even with different SDF value.

3.2 Multi-scale feature-metric consistency

Rendering loss tends to trap the model into sub-optimization since it only considers a single point and ignores the consistency among multiple viewpoints. To mitigate this problem, a straightforward practice is to project the image patches of multiple views to the estimated surface location based on the local planar assumption and rely on the photometric consistency to enforce the multi-view consistency. However, we found this solution works well for per-scene overfitting training darmon2022improving ; geoneus but brings limited benefits to generalization training.

Refer to caption
Figure 4: Multi-scale feature space. The feature space is more discriminative than ordinary image space, and is more potential to find the corresponding point during matching.

We analyze that the main reason may be the failure of photometric consistency, which becomes more challenging for generalization training. As proven in recent self-supervised multi-view stereo methods xu2021digging ; xu2021self ; yang2021self ; qiu2022self , the assumption of photometric consistency isn’t always effective, and the predicted geometry still has significant holes even in combination with the robust patch similarity like SSIM wang2004image . As the coordinate-based methods train models separately for each scene to directly overfit the scene, they have greater potential to converge to the optimum. However, our generalization model encodes all scenes with one model, and it requires image features to infer geometry, which makes the model rely heavily on the discriminability of features, e.g., regions like low-texture and reflection become more critical for degrading results. As shown in Fig 3 (a), those regions violating photometric consistency not only reduce the accuracy of multi-view matching, but also decrease the discriminability of generalization model’s input (we call this aggregation ambiguity), while the input of overfitting methods are distinct (3D coordinate).

To overcome these challenges, we propose the multi-scale feature-metric consistency to measure the consistency between views in a multi-scale feature space, as shown in Fig. 4. There are three main advantages of doing this way. First of all, the learnable feature is proven to be more discriminative than the original image johnson2016perceptual , especially on those ambiguous regions like low-texture and reflection. Second, due to the larger receptive field, multi-scale information is conducive to improving the matching accuracy, and allows the model to be assisted by global information while recovering details. More importantly, the feature discriminability can be continuously self-enhanced in the process of generalization training. The multi-scale feature space can train a powerful model through more accurate matching, and the more powerful model can in turn lead to a more discriminative feature space. And the enhanced feature can further mitigate the aforementioned aggregation ambiguity. These advantages have been proven in Tab. 3.

Refer to caption
Figure 5: Locating the surface of a ray.

To generate the geometry, we adopt the same approximate method as geoneus to directly locate the zero-level set of the SDF. As shown in Fig. 5, We first find the interval where a ray intersects the surface by checking whether the signs of the SDFs of two adjacent sampling points are different. To handle occlusion, we only extract the surface within the first interval. Suppose the two samples of the interval are p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p2subscript𝑝2p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and their distances to the camera center are t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively, our goal is to compute the position of pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Here, we rely on an assumption that two adjacent samples are close enough that the near surface can be regarded as a local plane. In this way, we can get two similar triangles:

p2pss2p1pss1,subscript𝑝2subscript𝑝𝑠subscript𝑠2subscript𝑝1subscript𝑝𝑠subscript𝑠1\triangle p_{2}p_{s}s_{2}\backsim\triangle p_{1}p_{s}s_{1},△ italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∽ △ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (6)

Therefore, we can approximate the distance from the surface to the camera center tssubscript𝑡𝑠t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as:

sdf(p2)sdf(p1)=t2tstst1ts=sdf(p1)t2sdf(p2)t1sdf(p1)sdf(p2).𝑠𝑑𝑓subscript𝑝2𝑠𝑑𝑓subscript𝑝1subscript𝑡2subscript𝑡𝑠subscript𝑡𝑠subscript𝑡1subscript𝑡𝑠𝑠𝑑𝑓subscript𝑝1subscript𝑡2𝑠𝑑𝑓subscript𝑝2subscript𝑡1𝑠𝑑𝑓subscript𝑝1𝑠𝑑𝑓subscript𝑝2\frac{-sdf(p_{2})}{sdf(p_{1})}=\frac{t_{2}-t_{s}}{t_{s}-t_{1}}\ \Rightarrow\ t% _{s}=\frac{sdf(p_{1})t_{2}-sdf(p_{2})t_{1}}{sdf(p_{1})-sdf(p_{2})}.divide start_ARG - italic_s italic_d italic_f ( italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_s italic_d italic_f ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG = divide start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ⇒ italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG italic_s italic_d italic_f ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_s italic_d italic_f ( italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_s italic_d italic_f ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_s italic_d italic_f ( italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG . (7)

We thus can get the coordinate of the surface point ps=o+tsdsubscript𝑝𝑠𝑜subscript𝑡𝑠𝑑p_{s}=o+t_{s}ditalic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_o + italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_d.

Through the automatic differentiation of the SDF network at pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we can get the corresponding normal nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Based on the assumption that the local surface centered at pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a plane of normal nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we can find the corresponding pixel position qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT source view that correspond to the pixel q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in reference view:

qi=Hiq0,Hi=Ki(RiR0T+Ri(RiTtiR0Tt0)nsTnsTps)K01.formulae-sequencesubscript𝑞𝑖subscript𝐻𝑖subscript𝑞0subscript𝐻𝑖subscript𝐾𝑖subscript𝑅𝑖superscriptsubscript𝑅0𝑇subscript𝑅𝑖superscriptsubscript𝑅𝑖𝑇subscript𝑡𝑖superscriptsubscript𝑅0𝑇subscript𝑡0superscriptsubscript𝑛𝑠𝑇superscriptsubscript𝑛𝑠𝑇subscript𝑝𝑠superscriptsubscript𝐾01q_{i}=H_{i}q_{0},\ H_{i}=K_{i}(R_{i}R_{0}^{T}+\frac{R_{i}(R_{i}^{T}t_{i}-R_{0}% ^{T}t_{0})n_{s}^{T}}{n_{s}^{T}p_{s}})K_{0}^{-1}.italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + divide start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ) italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (8)

For a pixel patch 𝐪0subscript𝐪0\mathbf{q}_{0}bold_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the reference view, we can find the corresponding source patch through passing all pixels to Eq. 8 like 𝐪i=Hi𝐪0subscript𝐪𝑖subscript𝐻𝑖subscript𝐪0\mathbf{q}_{i}=H_{i}\mathbf{q}_{0}bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Regardless of occlusion, if the estimated surface pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is accurate, then these corresponding patches should also be consistent. In this paper, we measure patch consistency in a multi-scale feature space. We only apply features at the top 3 scales, since features at lower scales lose a lot of structural information. Therefore, for a pixel patch at a certain view, we can get the multi-scale patches {Fj<𝐪>}j=02superscriptsubscriptsubscript𝐹𝑗expectation𝐪𝑗02\{F_{j}<\mathbf{q}>\}_{j=0}^{2}{ italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < bold_q > } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT through bilinear interpolation, and we upsample and concatenate them together as input Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, whose channel is Ch2=3×Ch𝐶subscript23𝐶Ch_{2}=3\times Chitalic_C italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 × italic_C italic_h, for patch similarity measure. Here, we employ the normalization cross correlation (NCC) to compute the feature-space consistency:

NCCi=1Ch2l=0Ch21Cov(F0l,Fil)Var(F0l)Var(Fil),𝑁𝐶subscript𝐶𝑖1𝐶subscript2superscriptsubscript𝑙0𝐶subscript21𝐶𝑜𝑣subscriptsuperscript𝐹0𝑙subscriptsuperscript𝐹𝑖𝑙𝑉𝑎𝑟subscriptsuperscript𝐹0𝑙𝑉𝑎𝑟subscriptsuperscript𝐹𝑖𝑙NCC_{i}=\frac{1}{Ch_{2}}\sum_{l=0}^{Ch_{2}-1}\frac{Cov(F^{\prime}_{0l},F^{% \prime}_{il})}{\sqrt{Var(F^{\prime}_{0l})Var(F^{\prime}_{il})}},italic_N italic_C italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_C italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_l = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_C italic_o italic_v ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 italic_l end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_V italic_a italic_r ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 italic_l end_POSTSUBSCRIPT ) italic_V italic_a italic_r ( italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT ) end_ARG end_ARG , (9)

where Cov𝐶𝑜𝑣Covitalic_C italic_o italic_v denotes covariance and Var𝑉𝑎𝑟Varitalic_V italic_a italic_r refers to variance. Following the common solution in multi-view stereo field galliani2015massively , we compute the final multi-scale feature-space consistency loss as the average of the best K𝐾Kitalic_K NCCs:

Lmfc=1Kk=0K1(1NCCk).subscript𝐿𝑚𝑓𝑐1𝐾superscriptsubscript𝑘0𝐾11𝑁𝐶subscript𝐶𝑘L_{mfc}=\frac{1}{K}\sum_{k=0}^{K-1}(1-NCC_{k}).italic_L start_POSTSUBSCRIPT italic_m italic_f italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ( 1 - italic_N italic_C italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (10)

3.3 View contrast loss

For a 3D structure captured by multiple viewpoints, there is a fact that some regions are covered by enough viewpoints, while some regions are only visible to a few viewpoints. Compared with the former, the aggregated features of the latter are more likely to be polluted by irrelevant rays, making them less predictable. To solve this problem, we design a view contrast loss to improve the accuracy of the reconstruction when visible views are limited, which enforces the geometric estimation to be the same under different inputs of the same scene.

Refer to caption
Figure 6: Visualization of view contrast loss.

We empirically lets results from dense inputs to supervise results of sparse inputs. Specially, taking a set of multi-view images as input, we first reconstruct a multi-scale volume as a teacher, which is used to infer the finer SDF value s𝑠sitalic_s for a set of 3D points P𝑃Pitalic_P. Then we build a student multi-scale volume from sparse input views and estimate the corresponding SDF value ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Meanwhile, as shown in Fig. 3 (b), we found that only the sampling points falling on the surface have positive epipolar correspondences, and their aggregated features are more meaningful, while other samples are more random, and may obtain the same aggregation even if their SDF values are different. As shown in Fig. 6, we thus only calculate the consistency loss for near-surface points, whose finer SDF values are more accurate:

Lvc=1|P|pP|s(p)s(p)|,subscript𝐿𝑣𝑐1superscript𝑃subscript𝑝superscript𝑃𝑠𝑝superscript𝑠𝑝L_{vc}=\frac{1}{|P^{\prime}|}\sum_{p\in P^{\prime}}|s(p)-s^{\prime}(p)|,italic_L start_POSTSUBSCRIPT italic_v italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_s ( italic_p ) - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_p ) | , (11)

where Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a set of points close to the surface inferred from the fine SDF according to Eq. 7.

3.4 Overall pipeline

This section will introduce some implementation details and crucial components of our model including generalization training and fine-tuning.

Loss function.

The overall loss function is defined as:

L=Lcolor+αLmfc+βLvc+Lreg.𝐿subscript𝐿𝑐𝑜𝑙𝑜𝑟𝛼subscript𝐿𝑚𝑓𝑐𝛽subscript𝐿𝑣𝑐subscript𝐿𝑟𝑒𝑔L=L_{color}+\alpha L_{mfc}+\beta L_{vc}+L_{reg}.italic_L = italic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT + italic_α italic_L start_POSTSUBSCRIPT italic_m italic_f italic_c end_POSTSUBSCRIPT + italic_β italic_L start_POSTSUBSCRIPT italic_v italic_c end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT . (12)

For a batch of sampled pixel set Q𝑄Qitalic_Q, the color loss is computed as the L1 distance between the rendered color and the ground-truth:

Lcolor=1|Q|qQ|C(q)C^(q)|.subscript𝐿𝑐𝑜𝑙𝑜𝑟1𝑄subscript𝑞𝑄𝐶𝑞^𝐶𝑞L_{color}=\frac{1}{|Q|}\sum_{q\in Q}|C(q)-\hat{C}(q)|.italic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_q ∈ italic_Q end_POSTSUBSCRIPT | italic_C ( italic_q ) - over^ start_ARG italic_C end_ARG ( italic_q ) | . (13)

To make the geometry more compact and accurate, we apply the regularization loss which is composed of four terms:

Lreg=γLek+ηLsmooth+λLtv+δLsparse.subscript𝐿𝑟𝑒𝑔𝛾subscript𝐿𝑒𝑘𝜂subscript𝐿𝑠𝑚𝑜𝑜𝑡𝜆subscript𝐿𝑡𝑣𝛿subscript𝐿𝑠𝑝𝑎𝑟𝑠𝑒L_{reg}=\gamma L_{ek}+\eta L_{smooth}+\lambda L_{tv}+\delta L_{sparse}.italic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = italic_γ italic_L start_POSTSUBSCRIPT italic_e italic_k end_POSTSUBSCRIPT + italic_η italic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT + italic_δ italic_L start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT . (14)

Eikonal loss gropp2020implicit is employed to regularize SDF values of all sampled points P𝑃Pitalic_P:

Lek=1|P|pP(sdf(p)21)2.subscript𝐿𝑒𝑘1𝑃subscript𝑝𝑃superscriptsubscriptnorm𝑠𝑑𝑓𝑝212L_{ek}=\frac{1}{|P|}\sum_{p\in P}(||\nabla sdf(p)||_{2}-1)^{2}.italic_L start_POSTSUBSCRIPT italic_e italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT ( | | ∇ italic_s italic_d italic_f ( italic_p ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (15)

To maintain the smooth of the surface, we introduce a regularization to the gradient of the normal:

Lsmooth=1|Q|qQngrad(q)2,subscript𝐿𝑠𝑚𝑜𝑜𝑡1𝑄subscript𝑞𝑄subscriptnormsubscript𝑛𝑔𝑟𝑎𝑑𝑞2L_{smooth}=\frac{1}{|Q|}\sum_{q\in Q}||n_{grad}(q)||_{2},italic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_q ∈ italic_Q end_POSTSUBSCRIPT | | italic_n start_POSTSUBSCRIPT italic_g italic_r italic_a italic_d end_POSTSUBSCRIPT ( italic_q ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (16)

where ngrad(q)subscript𝑛𝑔𝑟𝑎𝑑𝑞n_{grad}(q)italic_n start_POSTSUBSCRIPT italic_g italic_r italic_a italic_d end_POSTSUBSCRIPT ( italic_q ) is the alpha composition of normal gradient 2sdf(q)superscript2𝑠𝑑𝑓𝑞\nabla^{2}sdf(q)∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s italic_d italic_f ( italic_q ) in a ray through pixel q𝑞qitalic_q. Besides, we also adopt the total variation (TV) regularization rudin1994total for our multi-scale volumes:

Ltv=j=0L1Δx2(Vj)+Δy2(Vj)+Δz2(Vj).subscript𝐿𝑡𝑣superscriptsubscript𝑗0𝐿1superscriptsubscriptΔ𝑥2subscript𝑉𝑗superscriptsubscriptΔ𝑦2subscript𝑉𝑗superscriptsubscriptΔ𝑧2subscript𝑉𝑗L_{tv}=\sum_{j=0}^{L-1}\sqrt{\Delta_{x}^{2}(V_{j})+\Delta_{y}^{2}(V_{j})+% \Delta_{z}^{2}(V_{j})}.italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT square-root start_ARG roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + roman_Δ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG . (17)

To clean the geometric estimation, we introduce a sparsity prior:

Lsparse=1|P|pPexp(τ|sdf(p)|).subscript𝐿𝑠𝑝𝑎𝑟𝑠𝑒1𝑃subscript𝑝𝑃𝜏𝑠𝑑𝑓𝑝L_{sparse}=\frac{1}{|P|}\sum_{p\in P}\exp(-\tau|sdf(p)|).italic_L start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT roman_exp ( - italic_τ | italic_s italic_d italic_f ( italic_p ) | ) . (18)

Generalization training.

We select N=4𝑁4N=4italic_N = 4 for sparse setting and N=19𝑁19N=19italic_N = 19 for dense setting. We use Adam optimizer kingma2014adam with the base learning rate of 1e-3 for feature network and 5e-4 for other MLPs. We train the joint loss for 16 epochs on two A100 GPUs. We increase the value of α𝛼\alphaitalic_α from 0 to 1 and in the first 2 epochs. In our implementation, we generate the surface points Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of each image of the model trained with dense input first, and then distill the model with sparse input, with β𝛽\betaitalic_β set to 1. We build the generalized multi-scale volume with 5 scales, whose resolution increase from 24superscript242^{4}2 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT to 28superscript282^{8}2 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT. Each volume is equipped with thin features with only 4 feature channels, which allows us to save memory compared to general single volume methods.

Fine-tuning.

After generalization training, we first reconstruct the generalized multi-scale volume, which has encoded the geometry information. Then we sparse the multi-scale volume by pruning voxels far from the surface. During fine-tuning, we abandon the feature network, and directly optimize the multi-scale volume and MLPs. With the generalization prior, we can achieve state-of-the-art performance in only about 20 minutes of fine-tuning.

4 Experiments

We demonstrate the state-of-the-art performance of GenS with comprehensive experiments and verify the effectiveness of each module through ablation studies. We first introduce the datasets and then analyze our results.

Datasets.

We conduct experiments on both DTU jensen2014large and BlendedMVS yao2020blendedmvs datasets as previous methods wang2021neus ; yariv2021volume ; long2022sparseneus . Our generalization model is trained on DTU dataset, which is an indoor MVS dataset with 124 different scenes scaned from 49 or 64 views with fixed camera trajectories. Following yariv2020multiview ; wang2021neus ; long2022sparseneus , we take the same 15 scenes for testing. The training set is defined as in yao2018mvsnet ; peng2022rethinking , and the test scenes contained therein are removed. We also evaluate our model on BlendedMVS, which is a large-scale synthetic dataset. Each scene is scaned from different number of views, and all images has a resolution of 768×576768576768\times 576768 × 576. We report the Chamfer Distance for DTU, and show some visual effects for BlendedMVS.

Method 24 37 40 55 63 65 69 83 97 105 106 110 114 118 122 Mean
VolRecon* ren2022volrecon 1.20 2.59 1.56 1.08 1.43 1.92 1.11 1.48 1.42 1.05 1.19 1.38 0.74 1.23 1.27 1.38
PixelNerf yu2021pixelnerf 5.13 8.07 5.85 4.40 7.11 4.64 5.68 6.76 9.05 6.11 3.95 5.92 6.26 6.89 6.93 6.28
IBRNet wang2021ibrnet 2.29 3.70 2.66 1.83 3.02 2.83 1.77 2.28 2.73 1.96 1.87 2.13 1.58 2.05 2.09 2.32
MVSNerf chen2021mvsnerf 1.96 3.27 2.54 1.93 2.57 2.71 1.82 1.72 2.29 1.75 1.72 1.47 1.29 2.09 2.26 2.09
SparseNeuS long2022sparseneus 1.68 3.06 2.25 1.10 2.37 2.18 1.28 1.47 1.80 1.23 1.19 1.17 0.75 1.56 1.55 1.64
GenS 1.45 2.77 1.69 0.97 1.54 1.90 1.03 1.49 1.36 0.97 1.07 0.97 0.62 1.14 1.16 1.34
NeuS wang2021neus 4.57 4.49 3.97 4.32 4.63 1.95 4.68 3.83 4.15 2.50 1.52 6.47 1.26 5.57 6.11 4.00
VolSDF yariv2021volume 4.03 4.21 6.12 0.91 8.24 1.73 2.74 1.82 5.14 3.09 2.08 4.81 0.60 3.51 2.18 3.41
IBRNet (ft) 1.67 2.97 2.26 1.56 2.52 2.30 1.50 2.05 2.02 1.73 1.66 1.63 1.17 1.84 1.61 1.90
COLMAP schonberger2016pixelwise 0.90 2.89 1.63 1.08 2.18 1.94 1.61 1.30 2.34 1.28 1.10 1.42 0.76 1.17 1.14 1.52
SparseNeuS (ft) 1.29 2.27 1.57 0.88 1.61 1.86 1.06 1.27 1.42 1.07 0.99 0.87 0.54 1.15 1.18 1.27
GenS (ft) 0.91 2.33 1.46 0.75 1.02 1.58 0.74 1.16 1.05 0.77 0.88 0.56 0.49 0.78 0.93 1.03
Table 1: Quantitative results of Chamfer Distance on DTU dataset with sparse inputs. ‘*’ denotes that the method needs the ground-truth depth for supervision.

4.1 Comparisons

Results on DTU.

We first adopt the same testing split and configuration as long2022sparseneus to compare with existing generalizable methods chen2021mvsnerf ; wang2021ibrnet ; long2022sparseneus ; yu2021pixelnerf . The results shown in Tab. 1 indicate that our model outperforms existing methods by a significant margin, and this advantage can be amplified after rapid fine-tuning (about 20 mins). Even compared with recent method ren2022volrecon , which adopts the ground-truth depth for supervision, our model can achieve superior results. The qualitative results in Fig. 1 show that our reconstruction exhibits finer details. We further conducted more experiments on DTU to compare with per-scene overfitting methods with more input views. The quantitative comparisons in Tab. 2 show that our model can surpass some methods yariv2020multiview ; oechsle2021unisurf ; wang2021neus ; yariv2021volume just through a very fast network inference, i.e., we can achieve more than 34% improvement on scene 24 compared with wang2021neus . After a fast fine-tuning, the performance can be significantly improved, and even surpassing recent SOTA works darmon2022improving ; geoneus ; wang2022hf . Some visualization results in Fig. 7 depict that our model trained on large amounts of data is more robust to ambiguous regions.

Generalizing to BlendedMVS.

Like long2022sparseneus , we conduct additional verification on BlendedMVS to showcase the generalization ability of our model. We also employ the same evaluation strategy as long2022sparseneus for a fair comparison. As shown in Fig. 1, just through a fast network inference, our model can recover more geometric details than SparseNeuS long2022sparseneus . Through the fine-tuning of 5k iterations, the effect will be significantly improved and better than SparseNeuS with 10k iterations.

Method 24 37 40 55 63 65 69 83 97 105 106 110 114 118 122 Mean
IDR yariv2020multiview 1.63 1.87 0.63 0.48 1.04 0.79 0.77 1.33 1.16 0.76 0.67 0.90 0.42 0.51 0.53 0.90
MVSDF zhang2021learning 0.83 1.76 0.88 0.44 1.11 0.90 0.75 1.26 1.02 1.35 0.87 0.84 0.34 0.47 0.46 0.88
COLMAP schonberger2016pixelwise 0.45 0.91 0.37 0.37 0.90 1.00 0.54 1.22 1.08 0.64 0.48 0.59 0.32 0.45 0.43 0.65
NeRF mildenhall2020nerf 1.90 1.60 1.85 0.58 2.28 1.27 1.47 1.67 2.05 1.07 0.88 2.53 1.06 1.15 0.96 1.49
UNISURF oechsle2021unisurf 1.32 1.36 1.72 0.44 1.35 0.79 0.80 1.49 1.37 0.89 0.59 1.47 0.46 0.59 0.62 1.02
VolSDF yariv2021volume 1.14 1.26 0.81 0.49 1.25 0.70 0.72 1.29 1.18 0.70 0.66 1.08 0.42 0.61 0.55 0.86
NeuS wang2021neus 1.00 1.37 0.93 0.43 1.10 0.65 0.57 1.48 1.09 0.83 0.52 1.20 0.35 0.49 0.54 0.84
HF-NeuS wang2022hf 0.76 1.32 0.70 0.39 1.06 0.63 0.63 1.15 1.12 0.80 0.52 1.22 0.33 0.49 0.50 0.77
Voxurf wu2022voxurf 0.65 0.74 0.39 0.35 0.96 0.64 0.85 1.58 1.01 0.68 0.60 1.11 0.37 0.45 0.47 0.72
NeuralWarp darmon2022improving 0.49 0.71 0.38 0.38 0.79 0.81 0.82 1.20 1.06 0.68 0.66 0.74 0.41 0.63 0.51 0.68
Geo-NeuS* geoneus 0.46 0.83 0.38 0.39 0.88 0.61 0.51 1.26 0.92 0.68 0.57 0.82 0.30 0.41 0.42 0.63
GenS 0.66 1.01 0.71 0.43 1.06 0.99 0.73 1.43 1.18 0.78 0.64 0.93 0.38 0.54 0.54 0.80
GenS (ft) 0.55 0.71 0.39 0.38 0.79 0.65 0.57 1.29 0.96 0.64 0.49 0.59 0.33 0.44 0.45 0.62
Table 2: Quantitative results of Chamfer Distance on DTU dataset with dense inputs. ‘*’ denotes that we retrain the Geo-NeuS without sparse geometric supervision.
Refer to caption
Figure 7: Qualitative comparisons with per-scene overfitting methods on DTU dataset with dense inputs.

4.2 Analysis

Patch-similarity Multi-scale Self-enhanced Mean
1.86
1.76
1.73
1.62
Table 3: Some ablation studies on MFC.
[Uncaptioned image]
Figure 8: Reconstruction from volumes with different resolution.

Ablation studies.

We conduct ablation studies on DTU dataset to understand how the components of our model contribute to the overall performance. We start with our baseline model SparseNeuS long2022sparseneus , and gradually insert our contributions. The results in Tab. 4 show that our full model combining all components has the best mean score, and the baseline model, without any of our contributions, performs the worst. Multi-scale Feature-metric Consistency (MFC): Our self-enhanced MFC can continuously improve the multi-view consistency of the model, and we also elaborated the ablation results of its three main characteristics in Tab. 3. And the baseline is based on the pixel-wise feature consistency proposed in zhang2021learning . It can be seen that our strategy is more robust and efficiency. Generalized Multi-scale Volume (GMV): We show some results of the models have different resolutions in Fig. 8. We can see that the reconstruction of a single high-resolution volume is unbearably noisy (higher-resolution volume will lead to more empty voxels, which is more tricky for generalizable models.) and overly smooth at low resolution, whereas our GMV reconstructs clean and detailed geometry. And our representation is lighter than a single volume with a resolution of 32×192332superscript192332\times 192^{3}32 × 192 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT due to our thin feature. View Contrast Loss (VCL): As the results shown in Tab. 1 and Tab. 2, the reconstruction with dense inputs is more accurate than the sparse reconstruction, we therefore treat the former as a teacher to teach the latter. The results shown in Tab. 4 validate that this strategy can indeed improve the reconstruction quality of the model. The visualization of the ablation results are shown in Fig. 9, which further depicts that every contributions we propose can continuously improve performance.

Limitation.

Although our model exhibits excellent generalization performance in multi-view reconstruction, we found that it cannot satisfactorily handle scenes with large camera motion, such as surrounding cases. Because in these scenarios, the aggregation features will be polluted by the ray features shooting from behind. Our current solution is to first predict the local structure covered by some adjacent viewpoints like long2022sparseneus ; yao2018mvsnet ; peng2022rethinking , and finally fuse them together.

MFC GMV VCL 24 37 40 55 63 65 69 83 97 105 106 110 114 118 122 Mean
2.26 3.39 2.04 1.27 2.47 2.65 1.62 1.84 1.61 1.32 1.82 1.94 0.91 1.78 1.62 1.90
1.61 3.12 1.99 1.16 2.00 2.21 1.30 1.58 1.45 1.18 1.48 1.53 0.80 1.54 1.43 1.62
1.51 3.07 1.88 0.97 1.56 2.11 1.12 1.45 1.31 0.95 1.20 1.02 0.64 1.32 1.24 1.42
1.45 2.77 1.69 0.97 1.54 1.90 1.03 1.49 1.36 0.97 1.07 0.97 0.62 1.14 1.16 1.34
Table 4: Ablation results on DTU.
Refer to caption
Figure 9: Visualization of some ablation results on DTU.

5 Conclusion

In this paper, we introduced GenS, an end-to-end generalizable neural surface reconstruction model. We first encode all scenes into our generalized multi-scale volume, a more powerful representation that can reconstruct clean and detailed 3D structures. Then we introduce the multi-scale feature-metric consistency to combat the challenge of the photometric consistency failure. The learnable multi-scale feature can provide more discriminative representation and can be self-enhanced during the generalization training. And we finally designed a view contrast loss to improve the accuracy of the reconstruction through distilling the finer reconstruction from dense inputs to the reconstruction from sparse inputs. Experimental results on both DTU and BlendedMVS datasets show that our model possess stronger generalization ability and can achieve start-of-the-art reconstruction through fast network inference or efficient fine-tuning. In the future, we will focus on improving the performance of the model in difficult scenarios.

Acknowledgments and Disclosure of Funding

This work is financially supported by National Natural Science Foundation of China U21B2012 and 62072013, Shenzhen Science and Technology Program-Shenzhen Cultivation of Excellent Scientific and Technological Innovation Talents project(Grant No. RCJC20200714114435057) , Shenzhen Science and Technology Program-Shenzhen Hong Kong joint funding project (Grant No. SGDX20211123144400001), this work is also financially supported for Outstanding Talents Training Fund in Shenzhen. In addition, we thank our collaborators in Alibaba Group and the anonymous reviewers for their valuable comments.

References

  • [1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM TOG, 28(3):24, 2009.
  • [2] Fausto Bernardini, Joshua Mittleman, Holly Rushmeier, Cláudio Silva, and Gabriel Taubin. The ball-pivoting algorithm for surface reconstruction. IEEE TVCG, 5(4):349–359, 1999.
  • [3] Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, **gyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In ICCV, pages 14124–14133, 2021.
  • [4] François Darmon, Bénédicte Bascle, Jean-Clément Devaux, Pascal Monasse, and Mathieu Aubry. Improving neural implicit surfaces geometry with patch war**. In CVPR, pages 6260–6269, 2022.
  • [5] Qiancheng Fu, Qingshan Xu, Yew-Soon Ong, and Wenbing Tao. Geo-neus: geometry-consistent neural implicit surfaces learning for multi-view reconstruction. NeurIPS, 2022.
  • [6] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis. IEEE TPAMI, 32(8):1362–1376, 2009.
  • [7] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. In ICCV, pages 873–881, 2015.
  • [8] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna, William T Freeman, and Thomas Funkhouser. Learning shape templates with structured implicit functions. In ICCV, pages 7154–7164, 2019.
  • [9] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099, 2020.
  • [10] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and ** Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. In CVPR, pages 2495–2504, 2020.
  • [11] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, et al. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In UIST, pages 559–568, 2011.
  • [12] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In CVPR, pages 406–413, 2014.
  • [13] Mohammad Mahdi Johari, Yann Lepoittevin, and François Fleuret. Geonerf: Generalizing nerf with geometry priors. In CVPR, pages 18365–18375, 2022.
  • [14] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711. Springer, 2016.
  • [15] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In SGP, volume 7, page 0, 2006.
  • [16] Michael Kazhdan and Hugues Hoppe. Screened poisson surface reconstruction. ACM ToG, 32(3):1–13, 2013.
  • [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [18] Kiriakos N Kutulakos and Steven M Seitz. A theory of shape by space carving. IJCV, 38:199–218, 2000.
  • [19] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
  • [20] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. NeurIPS, 33:15651–15663, 2020.
  • [21] Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In CVPR, pages 2019–2028, 2020.
  • [22] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. Learning to infer implicit surfaces without 3d supervision. NeurIPS, 32, 2019.
  • [23] ** Wang. Sparseneus: Fast generalizable neural surface reconstruction from sparse views. In ECCV, pages 210–227. Springer, 2022.
  • [24] Nelson Max. Optical models for direct volume rendering. IEEE TVCG, 1(2):99–108, 1995.
  • [25] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In CVPR, pages 4460–4470, 2019.
  • [26] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, Mahsa Baktashmotlagh, and Anders Eriksson. Implicit surface representations as layers in neural networks. In ICCV, pages 4743–4752, 2019.
  • [27] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, pages 405–421, 2020.
  • [28] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM ToG, 41(4):1–15, 2022.
  • [29] Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In ECCV, pages 414–431. Springer, 2020.
  • [30] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Occupancy flow: 4d reconstruction by learning particle dynamics. In ICCV, pages 5379–5389, 2019.
  • [31] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In CVPR, pages 3504–3515, 2020.
  • [32] Matthias Nießner, Michael Zollhöfer, Shahram Izadi, and Marc Stamminger. Real-time 3d reconstruction at scale using voxel hashing. ACM ToG, 32(6):1–11, 2013.
  • [33] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. Texture fields: Learning texture representations in function space. In ICCV, pages 4531–4540, 2019.
  • [34] Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In ICCV, pages 5589–5599, 2021.
  • [35] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, pages 165–174, 2019.
  • [36] Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and Ronggang Wang. Rethinking depth estimation for multi-view stereo: A unified representation. In CVPR, pages 8645–8654, 2022.
  • [37] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. In ECCV, pages 523–540. Springer, 2020.
  • [38] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In CVPR, pages 10318–10327, 2021.
  • [39] Ke Qiu, Yawen Lai, Shiyi Liu, and Ronggang Wang. Self-supervised multi-view stereo via inter and intra network pseudo depth. In ACMMM, pages 2305–2313, 2022.
  • [40] Yufan Ren, Fang**hua Wang, Tong Zhang, Marc Pollefeys, and Sabine Süsstrunk. Volrecon: Volume rendering of signed ray distance functions for generalizable multi-view reconstruction. In CVPR, 2023.
  • [41] Leonid I Rudin and Stanley Osher. Total variation based image restoration with free local constraints. In ICIP, volume 1, pages 31–35. IEEE, 1994.
  • [42] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, pages 2304–2314, 2019.
  • [43] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, pages 4104–4113, 2016.
  • [44] Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In ECCV, pages 501–518. Springer, 2016.
  • [45] Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring. IJCV, 35:151–173, 1999.
  • [46] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, pages 5459–5469, 2022.
  • [47] Towaki Takikawa, Joey Litalien, Kangxue Yin, Karsten Kreis, Charles Loop, Derek Nowrouzezahrai, Alec Jacobson, Morgan McGuire, and Sanja Fidler. Neural geometric level of detail: Real-time rendering with implicit 3d shapes. In CVPR, pages 11358–11367, 2021.
  • [48] Fang**hua Wang, Silvano Galliani, Christoph Vogel, Pablo Speciale, and Marc Pollefeys. Patchmatchnet: Learned multi-view patchmatch stereo. In CVPR, pages 14194–14203, 2021.
  • [49] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wen** Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
  • [50] Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In CVPR, pages 4690–4699, 2021.
  • [51] Yiqun Wang, Ivan Skorokhodov, and Peter Wonka. Hf-neus: Improved surface reconstruction using high-frequency details. NeurIPS, 35:1966–1978, 2022.
  • [52] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004.
  • [53] Tong Wu, Jiaqi Wang, Xingang Pan, Xudong Xu, Christian Theobalt, Ziwei Liu, and Dahua Lin. Voxurf: Voxel-based efficient and accurate neural surface reconstruction. In arXiv preprint arXiv:2208.12697, 2022.
  • [54] Hongbin Xu, Zhipeng Zhou, Yu Qiao, Wenxiong Kang, and Qiuxia Wu. Self-supervised multi-view stereo via effective co-segmentation and data-augmentation.
  • [55] Hongbin Xu, Zhipeng Zhou, Yali Wang, Wenxiong Kang, Baigui Sun, Hao Li, and Yu Qiao. Digging into uncertainty in self-supervised multi-view stereo. In ICCV, pages 6078–6087, 2021.
  • [56] Qingshan Xu and Wenbing Tao. Multi-scale geometric consistency guided multi-view stereo. In CVPR, pages 5483–5492, 2019.
  • [57] Jiayu Yang, Jose M Alvarez, and Miaomiao Liu. Self-supervised learning of depth inference for multi-view stereo. In CVPR, pages 7526–7534, 2021.
  • [58] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. In ECCV, pages 767–783, 2018.
  • [59] Yao Yao, Zixin Luo, Shiwei Li, **gyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In CVPR, pages 1790–1799, 2020.
  • [60] Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. NeurIPS, 34:4805–4815, 2021.
  • [61] Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. NeurIPS, 33:2492–2502, 2020.
  • [62] Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In ICCV, pages 5752–5761, 2021.
  • [63] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, pages 4578–4587, 2021.
  • [64] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. In NeurIPS, volume 35, pages 25018–25032, 2022.
  • [65] **gyang Zhang, Yao Yao, and Long Quan. Learning signed distance field for multi-view surface reconstruction. In ICCV, pages 6525–6534, 2021.
  • [66] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.